Kernel Density Estimation - Epanechnikov KernelSQL

KDE estimates the probability distribution of a random variable. The bandwidth parameter controls the width of the kernel, influencing how smooth or detailed the estimated density curve is. A smaller bandwidth results in a more detailed estimation, while a larger bandwidth produces a smoother curve.

Execute this SQL

CREATE OR REPLACE MACRO KDE_EPANECH(data, varnum, bandwidth, bin_count := 30) AS TABLE 
WITH hist AS (
	FROM histogram_values(data, varnum, bin_count := bin_count)
)
SELECT hist.bin, k.kernel_value
FROM hist, LATERAL (
    SELECT 100 * AVG(
    IF(abs(hist.bin - varnum) / bandwidth  < 1,
    0.75 * (1 - POW(abs(hist.bin - varnum) / bandwidth, 2)) / bandwidth,
    0)) AS kernel_value
	FROM query_table(data)
) k
ORDER BY hist.bin ;

-- Following David Scott's rule, here is an estimate for bandwidth:
CREATE OR REPLACE MACRO KDE_BANDWIDTH(data, varnum) AS (
	FROM query_table(data)
	SELECT 1.06 * stddev(varnum) * pow(count(*), -1/5)
);

-- Usage
SET VARIABLE bandwidth = (SELECT KDE_BANDWIDTH(mydata, myvarnum)) ;

FROM KDE_EPANECH(mydata, myvarnum, getvariable('bandwidth')) ;

-- Inspiration and illustration: https://observablehq.com/@d3/kernel-density-estimation

Copy code

Éric Mauvière

Expand

Share link


SQL with PipesSQL

Pipes in SQL via psql extension created by Yannick Welsch

Execute this SQL

install psql from community;
load psql;

from 'https://sampledata.sidequery.ai/earthquakes.parquet' |>
limit 10000 |>
where status = 'Reviewed' |>
select
    data_type, 
    avg(depth), 
    avg(magnitudo)
group by all;

Copy code

Nico Ritschel

Expand

Share link


Split strings into version numbers and order properSQL

This snippet takes version numbers that might contain arbitrary additional information and splits them into a list of integers, that one can sort like `sort -V` does.

Execute this SQL

SELECT v FROM VALUES ('1.10.0'), ('1.3.0'), ('1.13.0.RELEASE') f(v) ORDER BY list_transform(string_split(v, '.'), x -> TRY_CAST (x AS INTEGER)) ASC;

Copy code

Michael Simons

Expand

Share link


Generate series of numbers in DuckDB

DuckDB has two common ways to generate a series of numbers: the range() function and the generate_series() function. They differ only in that the generate_series() function has a 'stop' value that's inclusive, while the 'stop' value of range() is exclusive.

generate_series with inclusive stop valueSQL

// generate_series(start, stop, step)
// get all even numbers, starting at 0 up to and including 100
SELECT * FROM generate_series(0,100,2);

Copy code

range with exclusive stop valueSQL

// range(start, stop, step)
// get all even numbers, starting at 0 up to and including 98
SELECT * FROM range(0,100,2);

Copy code

Generate range() as arraySQL

// Using range() as a column value instead of a table
// in your SQL statement will return an array of the
// numbers in the range
SELECT range(0,100,2)

Copy code

Ryan Boyd

Expand

Share link

Execute this SQL

-- This selects only specific fields from the CSV
-- It also renames one of the fields in the target schema
COPY (SELECT col1, col2, col3 AS new_col_name
	    FROM read_csv('~/data/source.csv',AUTO_DETECT=TRUE))
  TO '~/data/target.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');

Copy code

Robin Moffatt

Copy code

Expand

Share link

Execute this SQL

-- listing files
FROM glob('dataset/*');

-- reading from files
FROM 'dataset/*.parquet' LIMIT 100; 

-- reading parquet files metadata (min_value, max_value, null_count for each field in each file)
FROM parquet_metadata('dataset/*.parquet');

-- convert files or export tables to parquet
COPY (FROM tbl) TO 'file.parquet' (FORMAT 'parquet');
COPY 'data.csv' TO 'data.paruqet' (FORMAT 'parquet');

Copy code

Octavian Zarzu

Copy code

Expand

Share link