Search Results - Page 7 | DuckDB Snippets

‌

Updating CSV files in-placeSQL

DuckDB allows updating CSV files in-place.

Execute this SQL

COPY (FROM VALUES ('foo', 10/9), ('bar', 50/7), ('qux', 9/4) t(s, x))
TO 'example.csv';

COPY (SELECT s FROM 'example.csv') TO 'example.csv';
Copy code

Gábor Szárnyas

Created 08/13/24

‌

Query from Google SheetsSQL

Sometimes you have data over in Google Sheets and you want quickly and easily load it into DuckDB. With read_csv and the Google Sheets 'export to csv', its just one line of code.

Execute this SQL

-- you will need to create a sharelink for your sheet.
-- note the tab id (gid) and sheet id (in the url)

from read_csv('https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv&gid={tab_id}')
Copy code

Jacob Matson

Created 08/28/24

‌

Kernel Density Estimation - Epanechnikov KernelSQL

KDE estimates the probability distribution of a random variable. The bandwidth parameter controls the width of the kernel, influencing how smooth or detailed the estimated density curve is. A smaller bandwidth results in a more detailed estimation, while a larger bandwidth produces a smoother curve.

Execute this SQL

CREATE OR REPLACE MACRO KDE_EPANECH(data, varnum, bandwidth, bin_count := 30) AS TABLE 
WITH hist AS (
	FROM histogram_values(data, varnum, bin_count := bin_count)
)
SELECT hist.bin, k.kernel_value
FROM hist, LATERAL (
    SELECT 100 * AVG(
    IF(abs(hist.bin - varnum) / bandwidth  < 1,
    0.75 * (1 - POW(abs(hist.bin - varnum) / bandwidth, 2)) / bandwidth,
    0)) AS kernel_value
	FROM query_table(data)
) k
ORDER BY hist.bin ;

-- Following David Scott's rule, here is an estimate for bandwidth:
CREATE OR REPLACE MACRO KDE_BANDWIDTH(data, varnum) AS (
	FROM query_table(data)
	SELECT 1.06 * stddev(varnum) * pow(count(*), -1/5)
);

-- Usage
SET VARIABLE bandwidth = (SELECT KDE_BANDWIDTH(mydata, myvarnum)) ;

FROM KDE_EPANECH(mydata, myvarnum, getvariable('bandwidth')) ;

-- Inspiration and illustration: https://observablehq.com/@d3/kernel-density-estimation
Copy code

Éric Mauvière

Created 10/10/24

‌

SQL with PipesSQL

Pipes in SQL via psql extension created by Yannick Welsch

Execute this SQL

install psql from community;
load psql;

from 'https://sampledata.sidequery.ai/earthquakes.parquet' |>
limit 10000 |>
where status = 'Reviewed' |>
select
    data_type, 
    avg(depth), 
    avg(magnitudo)
group by all;
Copy code

Nico Ritschel

Created 10/15/24

‌

Split strings into version numbers and order properSQL

This snippet takes version numbers that might contain arbitrary additional information and splits them into a list of integers, that one can sort like `sort -V` does.

Execute this SQL

SELECT v FROM VALUES ('1.10.0'), ('1.3.0'), ('1.13.0.RELEASE') f(v) ORDER BY list_transform(string_split(v, '.'), x -> TRY_CAST (x AS INTEGER)) ASC;
Copy code

Michael Simons

Created 10/30/24

‌

Generate series of numbers in DuckDB

DuckDB has two common ways to generate a series of numbers: the range() function and the generate_series() function. They differ only in that the generate_series() function has a 'stop' value that's inclusive, while the 'stop' value of range() is exclusive.

generate_series with inclusive stop valueSQL

// generate_series(start, stop, step)
// get all even numbers, starting at 0 up to and including 100
SELECT * FROM generate_series(0,100,2);
Copy code

range with exclusive stop valueSQL

// range(start, stop, step)
// get all even numbers, starting at 0 up to and including 98
SELECT * FROM range(0,100,2);
Copy code

Generate range() as arraySQL

// Using range() as a column value instead of a table
// in your SQL statement will return an array of the
// numbers in the range
SELECT range(0,100,2)
Copy code

Ryan Boyd

Created 11/02/24

‌

Compute the length of a GPX track in metersSQL

Execute this SQL

LOAD spatial;

-- Pick for example any Garmin Connect or Strava export as GPX file
-- https://en.wikipedia.org/wiki/GPS_Exchange_Format

SELECT round(
         -- Compute the length of the geomentry in the units of the
         -- reference system being used
         ST_Length(
           -- Transform from standard GPS WGS-84 into a reference system 
           -- that uses meters. EPSG:25832 stands for ETRS89 / UTM zone 32N,
           -- Europe between 6°E and 12°E
           ST_transform(
             -- Fix the order of coordinates 
             -- (GXP read as long/lat, we need lat/long)
             ST_FlipCoordinates(
               -- Turn the WKT into a DuckDB Geometry
               ST_GeomFromWKB(wkb_geometry)
             ),
            'EPSG:4326',
            'EPSG:25832'
           )
       ) /1000, 2) AS 'Distance (km)'
FROM st_read('activity_11501782487.gpx', layer='tracks');
Copy code

Michael Simons

Created 07/08/23

‌

Query Parquet files and their metadataSQL

Execute this SQL

-- listing files
FROM glob('dataset/*');

-- reading from files
FROM 'dataset/*.parquet' LIMIT 100; 

-- reading parquet files metadata (min_value, max_value, null_count for each field in each file)
FROM parquet_metadata('dataset/*.parquet');

-- convert files or export tables to parquet
COPY (FROM tbl) TO 'file.parquet' (FORMAT 'parquet');
COPY 'data.csv' TO 'data.paruqet' (FORMAT 'parquet');
Copy code

Octavian Zarzu

Created 04/25/23