Kernel Density Estimation - Epanechnikov KernelSQL
KDE estimates the probability distribution of a random variable. The bandwidth parameter controls the width of the kernel, influencing how smooth or detailed the estimated density curve is. A smaller bandwidth results in a more detailed estimation, while a larger bandwidth produces a smoother curve.
Execute this SQL
CREATE OR REPLACE MACRO KDE_EPANECH(data, varnum, bandwidth, bin_count := 30) AS TABLE WITH hist AS ( FROM histogram_values(data, varnum, bin_count := bin_count) ) SELECT hist.bin, k.kernel_value FROM hist, LATERAL ( SELECT 100 * AVG( IF(abs(hist.bin - varnum) / bandwidth < 1, 0.75 * (1 - POW(abs(hist.bin - varnum) / bandwidth, 2)) / bandwidth, 0)) AS kernel_value FROM query_table(data) ) k ORDER BY hist.bin ; -- Following David Scott's rule, here is an estimate for bandwidth: CREATE OR REPLACE MACRO KDE_BANDWIDTH(data, varnum) AS ( FROM query_table(data) SELECT 1.06 * stddev(varnum) * pow(count(*), -1/5) ); -- Usage SET VARIABLE bandwidth = (SELECT KDE_BANDWIDTH(mydata, myvarnum)) ; FROM KDE_EPANECH(mydata, myvarnum, getvariable('bandwidth')) ; -- Inspiration and illustration: https://observablehq.com/@d3/kernel-density-estimation
Copy code
Expand
Share link
SQL with PipesSQL
Pipes in SQL via psql extension created by Yannick Welsch
Execute this SQL
install psql from community; load psql; from 'https://sampledata.sidequery.ai/earthquakes.parquet' |> limit 10000 |> where status = 'Reviewed' |> select data_type, avg(depth), avg(magnitudo) group by all;
Copy code
Expand
Share link
Split strings into version numbers and order properSQL
This snippet takes version numbers that might contain arbitrary additional information and splits them into a list of integers, that one can sort like `sort -V` does.
Execute this SQL
SELECT v FROM VALUES ('1.10.0'), ('1.3.0'), ('1.13.0.RELEASE') f(v) ORDER BY list_transform(string_split(v, '.'), x -> TRY_CAST (x AS INTEGER)) ASC;
Copy code
Expand
Share link
Generate series of numbers in DuckDB
DuckDB has two common ways to generate a series of numbers: the range() function and the generate_series() function. They differ only in that the generate_series() function has a 'stop' value that's inclusive, while the 'stop' value of range() is exclusive.
generate_series with inclusive stop valueSQL
// generate_series(start, stop, step) // get all even numbers, starting at 0 up to and including 100 SELECT * FROM generate_series(0,100,2);
Copy code
range with exclusive stop valueSQL
// range(start, stop, step) // get all even numbers, starting at 0 up to and including 98 SELECT * FROM range(0,100,2);
Copy code
Generate range() as arraySQL
// Using range() as a column value instead of a table // in your SQL statement will return an array of the // numbers in the range SELECT range(0,100,2)
Copy code
Expand
Share link
Execute this SQL
-- This selects only specific fields from the CSV -- It also renames one of the fields in the target schema COPY (SELECT col1, col2, col3 AS new_col_name FROM read_csv('~/data/source.csv',AUTO_DETECT=TRUE)) TO '~/data/target.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');
Copy code
Copy code
Expand
Share link
Execute this SQL
-- listing files FROM glob('dataset/*'); -- reading from files FROM 'dataset/*.parquet' LIMIT 100; -- reading parquet files metadata (min_value, max_value, null_count for each field in each file) FROM parquet_metadata('dataset/*.parquet'); -- convert files or export tables to parquet COPY (FROM tbl) TO 'file.parquet' (FORMAT 'parquet'); COPY 'data.csv' TO 'data.paruqet' (FORMAT 'parquet');
Copy code
Copy code
Expand
Share link