SQL, Python & More for DuckDB - Page 3

‌

Kernel Density Estimation - Epanechnikov KernelSQL

KDE estimates the probability distribution of a random variable. The bandwidth parameter controls the width of the kernel, influencing how smooth or detailed the estimated density curve is. A smaller bandwidth results in a more detailed estimation, while a larger bandwidth produces a smoother curve.

Execute this SQL

CREATE OR REPLACE MACRO KDE_EPANECH(data, varnum, bandwidth, bin_count := 30) AS TABLE 
WITH hist AS (
	FROM histogram_values(data, varnum, bin_count := bin_count)
)
SELECT hist.bin, k.kernel_value
FROM hist, LATERAL (
    SELECT 100 * AVG(
    IF(abs(hist.bin - varnum) / bandwidth  < 1,
    0.75 * (1 - POW(abs(hist.bin - varnum) / bandwidth, 2)) / bandwidth,
    0)) AS kernel_value
	FROM query_table(data)
) k
ORDER BY hist.bin ;

-- Following David Scott's rule, here is an estimate for bandwidth:
CREATE OR REPLACE MACRO KDE_BANDWIDTH(data, varnum) AS (
	FROM query_table(data)
	SELECT 1.06 * stddev(varnum) * pow(count(*), -1/5)
);

-- Usage
SET VARIABLE bandwidth = (SELECT KDE_BANDWIDTH(mydata, myvarnum)) ;

FROM KDE_EPANECH(mydata, myvarnum, getvariable('bandwidth')) ;

-- Inspiration and illustration: https://observablehq.com/@d3/kernel-density-estimation
Copy code

Éric Mauvière

Created 10/10/24

‌

Load content from Strapi CMS REST API to Parquet fileSQL

A nice trick to load data from Strapi CMS. The Api Token can be obtained in the Settings menu of Strapi. A nice way to let users maintain reference data using the CMS and be able to use it directly in DuckDB. Should work for both Strapi self-hosted and cloud.

Execute this SQL

INSTALL httpfs;
LOAD httpfs;

CREATE SECRET http (
    TYPE HTTP,
    EXTRA_HTTP_HEADERS MAP {
        'Authorization': 'Bearer [Api Token]'
    }
); 

-- Replace strapi.mydomain.com with your Strapi URL and replace `pets` with your content type
COPY (SELECT unnest(data, recursive:= true) FROM read_json_auto('https://strapi.mydomain.com/api/pets')) TO 'pets.parquet';
Copy code

Created 10/09/24

‌

Reading a fixed-width file in DuckDBSQL

Fixed-width files can be little difficult but IF you consider each line of data as a string which you can attack with duckdb and substr() its not that difficult ;)

Execute this SQL

CREATE OR REPLACE TABLE example_table AS
SELECT 
    CAST(substr(line, 1, 4) AS INTEGER) AS activity_year,
    CAST(substr(line, 5, 10) AS VARCHAR(10)) AS lei_or_respondent_id,
    CAST(substr(line, 15, 1) AS CHAR(1)) AS agency_code,
    CAST(substr(line, 16, 1) AS CHAR(1)) AS loan_type,
    CAST(substr(line, 17, 1) AS CHAR(1)) AS loan_purpose,
    CAST(substr(line, 18, 1) AS CHAR(1)) AS occupancy_type
FROM
    (SELECT column0 AS line FROM read_csv('data.txt', AUTO_DETECT=TRUE, skip=1));
Copy code

Chetan Amrao

Edited 10/02/24

‌

Query an Authenticated API EndpointSQL

Hit an endpoint that needs an API key. In this case, Stripe. The demo uses a public test stripe API key from the Stripe docs

Execute this SQL

CREATE SECRET http (
    TYPE HTTP,
    EXTRA_HTTP_HEADERS MAP {
        'Authorization': 'Bearer sk_test_VePHdqKTYQjKNInc7u56JBrQ'
    }
); 

select unnest(data) as customers 
from read_json('https://api.stripe.com/v1/customers');
Copy code

Archie Sarre Wood

Created 09/25/24

‌

Discretize numeric column following specified thresholdsSQL

discretize() converts a numeric column into discrete ordered ids, taking into account a list of thresholds.

Execute this SQL

CREATE OR REPLACE MACRO discretize(v, l) AS (
	WITH t1 AS (
		SELECT unnest(list_distinct(l)) as j
	), t2 AS (
		SELECT COUNT(*) + 1 c FROM t1 
	  WHERE try_cast(j AS float) <= v
	) FROM t2 
	SELECT IF(v IS NULL, NULL, c) 
) ;

--Usage
FROM 'https://raw.githubusercontent.com/thewiremonkey/factbook.csv/master/data/c2127.csv'
SELECT name, value, discretize(value, [2,3,4,5]) AS class ;
Copy code

Éric Mauvière

Created 09/18/24

‌

KMeans on one dimensional data with recursive CTESQL

Compute kmeans thresholds from a table with 2 columns : id (unique) and numeric. Outputs a list. Easy to extend to 2 dimensions data (x,y).

Execute this SQL

CREATE FUNCTION kmeans(tname, idcol_name, numcol_name, bins:=5, maxiter:=100) AS (
WITH RECURSIVE clusters(iter, cid, x) AS ( 
	WITH t1 AS (FROM query_table(tname) SELECT idcol_name AS id, numcol_name AS x)
	(SELECT 0, id, x FROM t1 LIMIT bins-1) 
	UNION ALL 
	SELECT iter + 1, cid, avg(px) FROM ( 
		SELECT iter, cid, p.x as px, 
		rank() OVER (PARTITION BY p.id ORDER BY (p.x-c.x)^2, c.x^2) r
		FROM t1 p, clusters c
	) x 
	WHERE x.r = 1 and iter < maxiter 
	GROUP BY ALL
)
SELECT list(x) FROM 
(FROM clusters WHERE iter = maxiter ORDER BY x)
) ;
Copy code

Éric Mauvière

Created 09/17/24

‌

Add a unique column (key) to an existing table in DuckDBSQL

We can use DuckDB's `SEQUENCE` to add a unique key to a table.Many source data files (such as CSV files) do not include a unique key.

Execute this SQL

CREATE TABLE tbl (s VARCHAR);

INSERT INTO tbl VALUES ( 'hello'),('world');

CREATE SEQUENCE id_sequence;

ALTER TABLE tbl ADD COLUMN id INTEGER DEFAULT nextval ('id_sequence');

SELECT * FROM tbl;
Copy code

colorcrow

Created 08/31/24

‌

Query from Google SheetsSQL

Sometimes you have data over in Google Sheets and you want quickly and easily load it into DuckDB. With read_csv and the Google Sheets 'export to csv', its just one line of code.

Execute this SQL

-- you will need to create a sharelink for your sheet.
-- note the tab id (gid) and sheet id (in the url)

from read_csv('https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv&gid={tab_id}')
Copy code

Jacob Matson

Created 08/28/24