SQL with PipesSQL

Pipes in SQL via psql extension created by Yannick Welsch

Execute this SQL

install psql from community;
load psql;

from 'https://sampledata.sidequery.ai/earthquakes.parquet' |>
limit 10000 |>
where status = 'Reviewed' |>
select
    data_type, 
    avg(depth), 
    avg(magnitudo)
group by all;

Copy code

Nico Ritschel

Expand

Share link


Split strings into version numbers and order properSQL

This snippet takes version numbers that might contain arbitrary additional information and splits them into a list of integers, that one can sort like `sort -V` does.

Execute this SQL

SELECT v FROM VALUES ('1.10.0'), ('1.3.0'), ('1.13.0.RELEASE') f(v) ORDER BY list_transform(string_split(v, '.'), x -> TRY_CAST (x AS INTEGER)) ASC;

Copy code

Michael Simons

Expand

Share link


Label columns based on source tableSQL

Commonly, tables that are joined together have overlapping column names. This snippet will rename all columns to have a prefix based on the source table. No more duplicate names! This is similar to the Pandas join feature of lsuffix and rsuffix.

Execute this SQL

SELECT 
    COLUMNS(t1.*) AS 't1_\0',
    COLUMNS(t2.*) AS 't2_\0'
FROM range(10) t1 
JOIN range(10) t2 ON t1.range = t2.range

Copy code

Alex Monahan

Expand

Share link

Execute this SQL

-- Using  Scaleway as an example, where first 75GB is free
-- https://www.scaleway.com/en/object-storage/
-- This also works for Google Storage, Digital Ocean Spaces, 
-- Wasabi and so on

-- set these like you do for AWS
SET s3_access_key_id='MY_KEY'
SET s3_secret_access_key='MY_SECRET'

-- set the endpoint and region to override the default of 
-- using AWS endpoints
SET s3_endpoint='s3.nl-ams.scw.cloud'
SET s3_region="nl-ams"

SELECT
  -- format a large number with helpful commas 
  -- for readability
  format('{:,}', count(*)) as "total rows in file" 
FROM "s3://<BUCKET_NAME>/<FILE_NAME>.zst.parquet"

Copy code

Chris Adams

Copy code

Expand

Share link

Execute this SQL

-- listing files
FROM glob('dataset/*');

-- reading from files
FROM 'dataset/*.parquet' LIMIT 100; 

-- reading parquet files metadata (min_value, max_value, null_count for each field in each file)
FROM parquet_metadata('dataset/*.parquet');

-- convert files or export tables to parquet
COPY (FROM tbl) TO 'file.parquet' (FORMAT 'parquet');
COPY 'data.csv' TO 'data.paruqet' (FORMAT 'parquet');

Copy code

Octavian Zarzu

Copy code

Expand

Share link

Execute this SQL

-- Here we make use of list_transform and 
-- the fact that DuckDB knows lambda expressions
SELECT list_transform(split('01:08:22', ':'), x -> CAST (x AS INTEGER));

Copy code

Michael Simons

Copy code

Expand

Share link

Execute this SQL

-- This selects only specific fields from the CSV
-- It also renames one of the fields in the target schema
COPY (SELECT col1, col2, col3 AS new_col_name
	    FROM read_csv('~/data/source.csv',AUTO_DETECT=TRUE))
  TO '~/data/target.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');

Copy code

Robin Moffatt

Copy code

Expand

Share link

Execute this SQL

LOAD spatial;

-- Pick for example any Garmin Connect or Strava export as GPX file
-- https://en.wikipedia.org/wiki/GPS_Exchange_Format

SELECT round(
         -- Compute the length of the geomentry in the units of the
         -- reference system being used
         ST_Length(
           -- Transform from standard GPS WGS-84 into a reference system 
           -- that uses meters. EPSG:25832 stands for ETRS89 / UTM zone 32N,
           -- Europe between 6°E and 12°E
           ST_transform(
             -- Fix the order of coordinates 
             -- (GXP read as long/lat, we need lat/long)
             ST_FlipCoordinates(
               -- Turn the WKT into a DuckDB Geometry
               ST_GeomFromWKB(wkb_geometry)
             ),
            'EPSG:4326',
            'EPSG:25832'
           )
       ) /1000, 2) AS 'Distance (km)'
FROM st_read('activity_11501782487.gpx', layer='tracks');

Copy code

Michael Simons

Copy code

Expand

Share link