Discretize numeric column following specified thresholdsSQL

discretize() converts a numeric column into discrete ordered ids, taking into account a list of thresholds.

Execute this SQL

CREATE OR REPLACE MACRO discretize(v, l) AS (
	WITH t1 AS (
		SELECT unnest(list_distinct(l)) as j
	), t2 AS (
		SELECT COUNT(*) + 1 c FROM t1 
	  WHERE try_cast(j AS float) <= v
	) FROM t2 
	SELECT IF(v IS NULL, NULL, c) 
) ;

--Usage
FROM 'https://raw.githubusercontent.com/thewiremonkey/factbook.csv/master/data/c2127.csv'
SELECT name, value, discretize(value, [2,3,4,5]) AS class ;

Copy code

Éric Mauvière

Expand

Share link


KMeans on one dimensional data with recursive CTESQL

Compute kmeans thresholds from a table with 2 columns : id (unique) and numeric. Outputs a list. Easy to extend to 2 dimensions data (x,y).

Execute this SQL

CREATE FUNCTION kmeans(tname, idcol_name, numcol_name, bins:=5, maxiter:=100) AS (
WITH RECURSIVE clusters(iter, cid, x) AS ( 
	WITH t1 AS (FROM query_table(tname) SELECT idcol_name AS id, numcol_name AS x)
	(SELECT 0, id, x FROM t1 LIMIT bins-1) 
	UNION ALL 
	SELECT iter + 1, cid, avg(px) FROM ( 
		SELECT iter, cid, p.x as px, 
		rank() OVER (PARTITION BY p.id ORDER BY (p.x-c.x)^2, c.x^2) r
		FROM t1 p, clusters c
	) x 
	WHERE x.r = 1 and iter < maxiter 
	GROUP BY ALL
)
SELECT list(x) FROM 
(FROM clusters WHERE iter = maxiter ORDER BY x)
) ;

Copy code

Éric Mauvière

Expand

Share link


Add a unique column (key) to an existing table in DuckDBSQL

We can use DuckDB's `SEQUENCE` to add a unique key to a table.Many source data files (such as CSV files) do not include a unique key.

Execute this SQL

CREATE TABLE tbl (s VARCHAR);

INSERT INTO tbl VALUES ( 'hello'),('world');

CREATE SEQUENCE id_sequence;

ALTER TABLE tbl ADD COLUMN id INTEGER DEFAULT nextval ('id_sequence');

SELECT * FROM tbl;

Copy code

colorcrow

Expand

Share link


Query from Google SheetsSQL

Sometimes you have data over in Google Sheets and you want quickly and easily load it into DuckDB. With read_csv and the Google Sheets 'export to csv', its just one line of code.

Execute this SQL

-- you will need to create a sharelink for your sheet.
-- note the tab id (gid) and sheet id (in the url)

from read_csv('https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv&gid={tab_id}')

Copy code

Jacob Matson

Expand

Share link


Emulate the Initcap functionSQL

In many database systems, there is a SQL function called something like INITCAP which capitalizes the first letter of all the words in a text string. Unfortunately, DuckDB doesn’t have this built-in, so let’s see if we can emulate it using function chaining and list comprehension.

Execute this SQL

SELECT ([upper (x[1])||x[2:] 
for x in 
(
'the quick brown fox jumped over the lazy dog'
).string_split(' ')]).list_aggr('string_agg',' ') 

Copy code

Thomas Reid

Expand

Share link


Updating CSV files in-placeSQL

DuckDB allows updating CSV files in-place.

Execute this SQL

COPY (FROM VALUES ('foo', 10/9), ('bar', 50/7), ('qux', 9/4) t(s, x))
TO 'example.csv';

COPY (SELECT s FROM 'example.csv') TO 'example.csv';

Copy code

Gábor Szárnyas

Expand

Share link


Specifying types in the CSV loaderSQL

We can specify types for the `read_csv` function on a per-column basis.

Execute this SQL

COPY (FROM VALUES ('foo', 10/9), ('bar', 50/7), ('qux', 9/4) t(s, x))
TO 'example.csv';

CREATE OR REPLACE TABLE example AS
    FROM read_csv('example.csv', types = {'x': 'DECIMAL(15, 3)'});

Copy code

Gábor Szárnyas

Expand

Share link


Deterministically shuffling dataSQL

Sometimes it's beneficial to shuffle data sets in a deterministic, repeatable way. We can achieve this by using the `hash` function on the `rowid` pseudocolumn.

Execute this SQL

COPY (FROM VALUES ('foo', 10/9), ('bar', 50/7), ('qux', 9/4) t(s, x))
TO 'example.csv';

CREATE OR REPLACE TABLE example AS FROM 'example.csv';
FROM example ORDER BY hash(rowid + 42);

Copy code

Gábor Szárnyas

Expand

Share link