SQL, Python & More for DuckDB - Page 2

‌

Unnest JSON Array into Rows (pseudo-json_each)SQL

Editor's note: as APIs and applications increasingly represent tabular data in JSON format, we sometimes want the comfort of seeing it as rows, queryable in SQL. DuckDB has a great extension for parsing JSON, including the ability to convert into a DuckDB list which can then be unnested into rows.

Execute this SQL

select 
    -- Split DuckDB list into multiple rows
    unnest(
        -- Convert JSON array into DuckDB list of JSON elements
        from_json(
            json('[{"col1":1, "col2":2}, {"col1":10, "col2":20}]'),
            '["JSON"]'
        )
    );
/* 
Returns:
{"col1":1,"col2":2}
{"col1":10,"col2":20}
*/
Copy code

Alex Monahan

Edited 02/23/24

‌

Query an Authenticated API EndpointSQL

Hit an endpoint that needs an API key. In this case, Stripe. The demo uses a public test stripe API key from the Stripe docs

Execute this SQL

CREATE SECRET http (
    TYPE HTTP,
    EXTRA_HTTP_HEADERS MAP {
        'Authorization': 'Bearer sk_test_VePHdqKTYQjKNInc7u56JBrQ'
    }
); 

select unnest(data) as customers 
from read_json('https://api.stripe.com/v1/customers');
Copy code

Archie Sarre Wood

Created 09/25/24

‌

KMeans on one dimensional data with recursive CTESQL

Compute kmeans thresholds from a table with 2 columns : id (unique) and numeric. Outputs a list. Easy to extend to 2 dimensions data (x,y).

Execute this SQL

CREATE FUNCTION kmeans(tname, idcol_name, numcol_name, bins:=5, maxiter:=100) AS (
WITH RECURSIVE clusters(iter, cid, x) AS ( 
	WITH t1 AS (FROM query_table(tname) SELECT idcol_name AS id, numcol_name AS x)
	(SELECT 0, id, x FROM t1 LIMIT bins-1) 
	UNION ALL 
	SELECT iter + 1, cid, avg(px) FROM ( 
		SELECT iter, cid, p.x as px, 
		rank() OVER (PARTITION BY p.id ORDER BY (p.x-c.x)^2, c.x^2) r
		FROM t1 p, clusters c
	) x 
	WHERE x.r = 1 and iter < maxiter 
	GROUP BY ALL
)
SELECT list(x) FROM 
(FROM clusters WHERE iter = maxiter ORDER BY x)
) ;
Copy code

Éric Mauvière

Created 09/17/24

‌

build-your-own SUMMARIZE with columns(), unpivot, and struct.SQL

Editor's note: Although SUMMARIZE() is fully supported in DuckDB, Hamilton shows you how to build this capability yourself using more primitive functions.

Execute this SQL

-- C = number of columns in TBL, S = number of aggregates
-- ------------------------------------------------------
-- (1) use columns(*) to produce S aggregates
-- for each column as a single struct,
-- producing 1 struct row and C columns
with aggregate as (
  from sf311 select 
    {
      name: first(alias(columns(*))),
      type: first(typeof(columns(*))),
      max: max(columns(*))::varchar,
      min: min(columns(*))::varchar,
      approx_unique: approx_count_distinct(columns(*)),
      nulls: count(*) - count(columns(*))
    }
),
-- (2) unpivot to get a result with C rows and 1 column of structs
columns as (unpivot aggregate on columns(*))
-- (3) explode the struct to get a result with C rows and S columns
select value.* from columns;
Copy code

Hamilton Ulmer

Edited 02/23/24

‌

Parquet to ExcelSQL

Editor's note: although not very intuitive, format for MS Excel XLSX files is provided by the spatial extension. While other snippets show you how to read Excel files, this shows you how to write data from a named Parquet file (or any DuckDB resultset) to Excel.

Execute this SQL

INSTALL spatial;
LOAD spatial;
COPY (SELECT * FROM 'source.parquet') TO 'target.XLSX' WITH (FORMAT GDAL, DRIVER 'XLSX');
Copy code

Simon Müller

Edited 02/23/24

‌

Download CSV and convert to ParquetSQL

Execute this SQL

COPY (SELECT * FROM read_csv_auto(
'http://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv'))
TO 'movies.parquet' (FORMAT 'parquet');
Copy code

Ryan Boyd

Created 04/10/23

‌

Query Yahoo Finance data with Scrooge extensionSQL

Editor's note: The Scrooge McDuck extension provides common aggregation functions and data scanners used for financial data. This example grabs stock quotes (or the S&P 500 in this case) on specific dates for analysis in DuckDB.

Execute this SQL

-- Install httpfs extension
INSTALL httpfs;
LOAD httpfs;
-- Install Scrooge extension https://github.com/pdet/Scrooge-McDuck
-- NOTE: You need to start DuckDB with `-unsigned` flag to authorize to install & load of 3rd party extension
SET custom_extension_repository='scrooge-duckdb.s3.us-west-2.amazonaws.com/scrooge/s3_deploy';
INSTALL scrooge;
LOAD scrooge;
-- Example of query 
FROM yahoo_finance("^GSPC", "2023-02-01", "2023-02-04", "1d");
Copy code

Mehdi Ouazza

Edited 02/23/24

‌

Dynamic SQL in DuckDB CLISQL

Editor's note: this gets a little recursive for all those SQL nerds out there. It shows you how to use SQL to generate SQL. It also shows you some of the dot commands for interacting with the DuckDB CLI: changing the output format and writing SQL to a file which is then read in for execution.

Execute this SQL

-- Use SQL to generate SQL statements.
-- Write them to a file, then execute that file.
-- This uses DuckDB CLI-specific features, so it only works in the CLI.
.mode list
.header off
.once my_dynamic_sql.sql
select 'CREATE TABLE t' || i || ' AS SELECT ' || i || ' AS my_column;' from generate_series(5) x(i);
.read my_dynamic_sql.sql

select table_name from duckdb_tables(); -- Check our work!
Copy code

Alex Monahan

Edited 02/23/24

Unnest JSON Array into Rows (pseudo-json_each)SQL

Execute this SQL

Query an Authenticated API EndpointSQL

Execute this SQL

KMeans on one dimensional data with recursive CTESQL

Execute this SQL

build-your-own SUMMARIZE with columns(*), unpivot, and struct.*SQL

Execute this SQL

Parquet to ExcelSQL

Execute this SQL

Download CSV and convert to ParquetSQL

Execute this SQL

Query Yahoo Finance data with Scrooge extensionSQL

Execute this SQL

Dynamic SQL in DuckDB CLISQL

Execute this SQL

build-your-own SUMMARIZE with columns(), unpivot, and struct.SQL