Unnest JSON Array into Rows (pseudo-json_each)SQL
Editor's note: as APIs and applications increasingly represent tabular data in JSON format, we sometimes want the comfort of seeing it as rows, queryable in SQL. DuckDB has a great extension for parsing JSON, including the ability to convert into a DuckDB list which can then be unnested into rows.
Execute this SQL
select -- Split DuckDB list into multiple rows unnest( -- Convert JSON array into DuckDB list of JSON elements from_json( json('[{"col1":1, "col2":2}, {"col1":10, "col2":20}]'), '["JSON"]' ) ); /* Returns: {"col1":1,"col2":2} {"col1":10,"col2":20} */Copy code
Expand
Share link
Query an Authenticated API EndpointSQL
Hit an endpoint that needs an API key. In this case, Stripe. The demo uses a public test stripe API key from the Stripe docs
Execute this SQL
CREATE SECRET http ( TYPE HTTP, EXTRA_HTTP_HEADERS MAP { 'Authorization': 'Bearer sk_test_VePHdqKTYQjKNInc7u56JBrQ' } ); select unnest(data) as customers from read_json('https://api.stripe.com/v1/customers');Copy code
Expand
Share link
Parquet to ExcelSQL
Editor's note: although not very intuitive, format for MS Excel XLSX files is provided by the spatial extension. While other snippets show you how to read Excel files, this shows you how to write data from a named Parquet file (or any DuckDB resultset) to Excel.
Execute this SQL
INSTALL spatial; LOAD spatial; COPY (SELECT * FROM 'source.parquet') TO 'target.XLSX' WITH (FORMAT GDAL, DRIVER 'XLSX');Copy code
Expand
Share link
KMeans on one dimensional data with recursive CTESQL
Compute kmeans thresholds from a table with 2 columns : id (unique) and numeric. Outputs a list. Easy to extend to 2 dimensions data (x,y).
Execute this SQL
CREATE FUNCTION kmeans(tname, idcol_name, numcol_name, bins:=5, maxiter:=100) AS ( WITH RECURSIVE clusters(iter, cid, x) AS ( WITH t1 AS (FROM query_table(tname) SELECT idcol_name AS id, numcol_name AS x) (SELECT 0, id, x FROM t1 LIMIT bins-1) UNION ALL SELECT iter + 1, cid, avg(px) FROM ( SELECT iter, cid, p.x as px, rank() OVER (PARTITION BY p.id ORDER BY (p.x-c.x)^2, c.x^2) r FROM t1 p, clusters c ) x WHERE x.r = 1 and iter < maxiter GROUP BY ALL ) SELECT list(x) FROM (FROM clusters WHERE iter = maxiter ORDER BY x) ) ;Copy code
Expand
Share link
build-your-own SUMMARIZE with columns(*), unpivot, and struct.*SQL
Editor's note: Although SUMMARIZE() is fully supported in DuckDB, Hamilton shows you how to build this capability yourself using more primitive functions.
Execute this SQL
-- C = number of columns in TBL, S = number of aggregates -- ------------------------------------------------------ -- (1) use columns(*) to produce S aggregates -- for each column as a single struct, -- producing 1 struct row and C columns with aggregate as ( from sf311 select { name: first(alias(columns(*))), type: first(typeof(columns(*))), max: max(columns(*))::varchar, min: min(columns(*))::varchar, approx_unique: approx_count_distinct(columns(*)), nulls: count(*) - count(columns(*)) } ), -- (2) unpivot to get a result with C rows and 1 column of structs columns as (unpivot aggregate on columns(*)) -- (3) explode the struct to get a result with C rows and S columns select value.* from columns;Copy code
Expand
Share link
Execute this SQL
COPY (SELECT * FROM read_csv_auto( 'http://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv')) TO 'movies.parquet' (FORMAT 'parquet');Copy code
Copy code
Expand
Share link
Execute this SQL
-- create macro CREATE OR REPLACE MACRO udf_products_in_year (v_year, v_category) AS TABLE SELECT name, category, created_at FROM products WHERE category = v_category AND year(created_at) = v_year; -- select using the macro as you would do from a table SELECT * FROM udf_products_in_year (2020, 'Home and Garden'); | Copper Light | Home and Garden | 2020-04-05 00:00:00.000 | | Pink Armchair | Home and Garden | 2020-06-23 00:00:00.000 | -- input ddl and data CREATE TABLE products ( name varchar, category varchar, created_at timestamp ); INSERT INTO products VALUES ('Cream Sofa', 'Home and Garden', '2019-03-14'), ('Copper Light', 'Home and Garden', '2020-04-05'), ('Pink Armchair', 'Home and Garden', '2020-06-23');Copy code
Copy code
Expand
Share link
Query Yahoo Finance data with Scrooge extensionSQL
Editor's note: The Scrooge McDuck extension provides common aggregation functions and data scanners used for financial data. This example grabs stock quotes (or the S&P 500 in this case) on specific dates for analysis in DuckDB.
Execute this SQL
-- Install httpfs extension INSTALL httpfs; LOAD httpfs; -- Install Scrooge extension https://github.com/pdet/Scrooge-McDuck -- NOTE: You need to start DuckDB with `-unsigned` flag to authorize to install & load of 3rd party extension SET custom_extension_repository='scrooge-duckdb.s3.us-west-2.amazonaws.com/scrooge/s3_deploy'; INSTALL scrooge; LOAD scrooge; -- Example of query FROM yahoo_finance("^GSPC", "2023-02-01", "2023-02-04", "1d");Copy code
Expand
Share link