KMeans on one dimensional data with recursive CTE

Compute kmeans thresholds from a table with 2 columns : id (unique) and numeric. Outputs a list. Easy to extend to 2 dimensions data (x,y).

Execute this SQL

CREATE FUNCTION kmeans(tname, idcol_name, numcol_name, bins:=5, maxiter:=100) AS (
WITH RECURSIVE clusters(iter, cid, x) AS ( 
	WITH t1 AS (FROM query_table(tname) SELECT idcol_name AS id, numcol_name AS x)
	(SELECT 0, id, x FROM t1 LIMIT bins-1) 
	UNION ALL 
	SELECT iter + 1, cid, avg(px) FROM ( 
		SELECT iter, cid, p.x as px, 
		rank() OVER (PARTITION BY p.id ORDER BY (p.x-c.x)^2, c.x^2) r
		FROM t1 p, clusters c
	) x 
	WHERE x.r = 1 and iter < maxiter 
	GROUP BY ALL
)
SELECT list(x) FROM 
(FROM clusters WHERE iter = maxiter ORDER BY x)
) ;

Copy code

Éric Mauvière

Expand

Share link


Query Yahoo Finance data with Scrooge extension

Editor's note: The Scrooge McDuck extension provides common aggregation functions and data scanners used for financial data. This example grabs stock quotes (or the S&P 500 in this case) on specific dates for analysis in DuckDB.

Execute this SQL

-- Install httpfs extension
INSTALL httpfs;
LOAD httpfs;
-- Install Scrooge extension https://github.com/pdet/Scrooge-McDuck
-- NOTE: You need to start DuckDB with `-unsigned` flag to authorize to install & load of 3rd party extension
SET custom_extension_repository='scrooge-duckdb.s3.us-west-2.amazonaws.com/scrooge/s3_deploy';
INSTALL scrooge;
LOAD scrooge;
-- Example of query 
FROM yahoo_finance("^GSPC", "2023-02-01", "2023-02-04", "1d");

Copy code

Mehdi Ouazza

Expand

Share link


Create partitioned Parquet files from a remote CSV source

Editor's note: DuckDB can create partitioned Parquet files - allowing you to store your data in partitions (eg orders for specific dates, traffic from specific IPs, etc) based on predictable filenames. This allows for more performant queries from cloud storage as only the needed files are retrieved.

Execute this SQL

-- Read from a remote CSV file, and write partitioned Parquet files to local target
-- Queries like this are commonly used in Data Lakes
COPY (SELECT cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region from read_csv_auto('https://raw.githubusercontent.com/tobilg/public-cloud-provider-ip-ranges/main/data/providers/all.csv')) TO '/tmp/ip-ranges' (FORMAT PARQUET, PARTITION_BY cloud_provider);

Copy code

Tobias Müller

Expand

Share link


Combine several parquet files into one and compress with zstd

Editor's note: another great example of using DuckDB's wide data format support to merge/combine multiple Parquet files. Parth also kindly shows you how to compress the resulting Parquet file with the zstd codec. DuckDB also supports gzip and snappy compression codecs.

Execute this Bash

duckdb -c "COPY (SELECT * FROM '~/Binance_Spot_Data/*.parquet') TO 'binance.parquet' (FORMAT 'PARQUET', CODEC 'zstd')"

Copy code

Parth Patil

Expand

Share link


Put null values last when sorting (like Excel or Postgres)

Editor's note: DuckDB enables you to configure whether NULL values are returned first or last in result sets by default. You can also specify it per query using NULLS LAST in the query ORDER BY clause. Note that NULLS LAST is now the default with 0.8.0+.

Execute this SQL

PRAGMA default_null_order='NULLS LAST';

Copy code

Matt Holden

Expand

Share link


Query S3 Access Logs

Editor's note: Want to read log files with DuckDB? You can use the read_csv function and custom date/time + regex parsing to do it. To make the data more useful, you can specifically CAST some of the values as numerical types. This snippet also shows CASE WHEN ELSE statements in action.

Execute this SQL


/* 
Background: If you have S3 Access Logging enabled on one of your S3 buckets, you'll have some useful information about requests to your bucket. Unfortunately, it's in a semistructured format that can be difficult to parse. This SQL query will can help in this manner, both pulling out individual fields and coersing them to native data types.

Usage: you'll want to search for the strings <bucket> and <prefix>, and insert the S3 bucket where your access logs are being delivered. Use (or delete) <prefix> to filter to a subset of your logs.

Also, these commented out configuration settings you can either run  yourself in the REPL and source this file using `.read parse_s3_access_logs.sql`, or you can uncomment them and supply values for yourself.
*/

-- install https;
-- load https;
-- SET s3_region='us-west-2';
-- SET s3_access_key_id='';
-- SET s3_secret_access_key='';

WITH parsed_logs AS (
    SELECT
        regexp_extract(col1, '^([0-9a-zA-Z]+)\s+([a-z0-9.\-]+)\s+\[([0-9/A-Za-z: +]+)\] ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ("[^"]*"|-) ([^ ]+) ([^ ]+) (\d+|-) (\d+|-) (\d+|-) (\d+|-) ("[^"]*"|-) ("[^"]*"|-) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+)(.*)$',
        ['bucket_owner', 'bucket', 'timestamp', 'remote_ip', 'request', 'request_id', 'operation', 's3_key', 'request_uri', 'http_status', 's3_errorcode', 'bytes_sent','object_size', 'total_time', 'turn_around_time', 'referer', 'user_agent', 'version_id', 'host_id', 'sigver', 'cyphersuite', 'auth_type', 'host_header', 'tls_version', 'access_point_arn', 'acl_required', 'extra']
    ) AS log_struct
    FROM
        -- Trick the CSV reader into reading as a single column
        read_csv(
            's3://<bucket>/<prefix>/*',
            columns={'col1': 'VARCHAR'},
            -- Use a *hopefully* nonsensical deliminator, so no ',' chars screw us up
            delim='\0'
        )
)
SELECT
        -- Grab everything from the struct that we want as strings, exclude stuff we'll coersce to diff types
        log_struct.* exclude (timestamp, bytes_sent, object_size, total_time, turn_around_time),
        strptime(log_struct.timestamp, '%d/%b/%Y:%H:%M:%S %z') AS timestamp,
        CASE
                WHEN log_struct.bytes_sent = '-' THEN NULL
                ELSE CAST(log_struct.bytes_sent AS INTEGER)
        END AS bytes_sent,
        CASE
                WHEN log_struct.object_size = '-' THEN NULL
                ELSE CAST(log_struct.object_size AS INTEGER)
        END AS object_size,
        CASE
                WHEN log_struct.total_time = '-' THEN NULL
                ELSE CAST(log_struct.total_time AS INTEGER)
        END AS total_time,
        CASE
                WHEN log_struct.turn_around_time = '-' THEN NULL
                ELSE CAST(log_struct.turn_around_time AS INTEGER)
        END AS turn_around_time
FROM parsed_logs;

Copy code

Mark Roddy

Expand

Share link


Working with spatial data

Editor's note: Geospatial data is increasingly important for analytics - whether you're looking at data like store inventory, customer location or the weather. The spatial extension for DuckDB provides support for common data formats, calculations and searching within geometries.

Create a point from latitude and longitude pairsSQL

-- Install spatial extension
INSTALL spatial; 
LOAD spatial; 

-- Represent a latitude and longitude as a point
-- The Eiffel Tower in Paris, France has a 
-- latitude of 48.858935 and longitude of 2.293412
-- We can represent this location as a point
SELECT st_point(48.858935, 2.293412) AS Eiffel_Tower;

Copy code

Find the distance between two locations (in meters)SQL

-- Distance between the Eiffel Tower and the Arc de Triomphe in Paris
-- Using the EPSG spatial reference systems: 
-- EPSG:4326 geographic coordinates as latitude and longitude pairs
-- EPSG:27563 projection that covers northern France and uses meters 

SELECT
    st_point(48.858935, 2.293412) AS Eiffel_Tower, 
    st_point(48.873407, 2.295471) AS Arc_de_Triomphe,
    st_distance(
        st_transform(Eiffel_Tower, 'EPSG:4326', 'EPSG:27563'), 
        st_transform(Arc_de_Triomphe, 'EPSG:4326', 'EPSG:27563')
    ) AS Aerial_Distance_M;

Copy code

Find the country for given latitude and longitude locationSQL

-- Load the geometry outline for each country
-- Save the country name and "geom" border in table world_boundaries
CREATE OR REPLACE TABLE world_boundaries
AS
SELECT *
FROM st_read('https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/world-administrative-boundaries/exports/geojson');

-- Find the enclosing country for a given point
-- We can which country the Eiffel Tower is in 
SELECT name, region
FROM world_boundaries
WHERE ST_Within(st_point(2.293412, 48.858935) , geom);

Copy code

Simon Aubury

Expand

Share link