Parquet to ExcelSQL

Editor's note: although not very intuitive, format for MS Excel XLSX files is provided by the spatial extension. While other snippets show you how to read Excel files, this shows you how to write data from a named Parquet file (or any DuckDB resultset) to Excel.

Execute this SQL

INSTALL spatial;
LOAD spatial;
COPY (SELECT * FROM 'source.parquet') TO 'target.XLSX' WITH (FORMAT GDAL, DRIVER 'XLSX');

Copy code

Simon Müller

Expand

Share link


build-your-own SUMMARIZE with columns(*), unpivot, and struct.*SQL

Editor's note: Although SUMMARIZE() is fully supported in DuckDB, Hamilton shows you how to build this capability yourself using more primitive functions.

Execute this SQL

-- C = number of columns in TBL, S = number of aggregates
-- ------------------------------------------------------
-- (1) use columns(*) to produce S aggregates
-- for each column as a single struct,
-- producing 1 struct row and C columns
with aggregate as (
  from sf311 select 
    {
      name: first(alias(columns(*))),
      type: first(typeof(columns(*))),
      max: max(columns(*))::varchar,
      min: min(columns(*))::varchar,
      approx_unique: approx_count_distinct(columns(*)),
      nulls: count(*) - count(columns(*))
    }
),
-- (2) unpivot to get a result with C rows and 1 column of structs
columns as (unpivot aggregate on columns(*))
-- (3) explode the struct to get a result with C rows and S columns
select value.* from columns;

Copy code

Hamilton Ulmer

Expand

Share link

Execute this SQL

-- First, transform TBL (or any relation) 
-- into a single column of structs.
with structs (from TBL select TBL)
-- Then pack those structs into a list.
from structs select list(TBL) as ready_to_plot;

Copy code

Hamilton Ulmer

Copy code

Expand

Share link

Execute this SQL

--Below SQL to generate MD5 of a string

select md5('test')

Copy code

Expand

Share link

Execute this SQL

-- list and array_agg take in their own ORDER BY clause, so that you
-- can sort the aggregate. The statements order by cannot be used 
-- as the columns that are used for sorting then need to be a grouping
-- key and cannot be used in the aggregate

SELECT name, 
       -- Order the aggregated list by another column from line_items  
       list(item_name ORDER BY pos ASC) items 
FROM orders 
JOIN line_items ON order_id = id 
GROUP BY ALL
-- Order by grouping keys is ofc possible
ORDER BY name;

Copy code

Michael Simons

Copy code

Expand

Share link


Loading remote parquet files

Editor's note: DuckDB supports lambda functions for easy transformation. It also supports reading data over http/https [in addition from blob stores]. Simon's snippet shows you how to use both of those capabilities to dynamically generate filenames and query them when globbing is unsupported.

Load a single parquet fileSQL

-- Query a single remote parquet file 
-- Query the January data from NY taxi data set
SELECT tpep_pickup_datetime, total_amount
FROM read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet')
LIMIT 10;

Copy code

Dynamically iterate over remote parquet files - lambda function and list_transformSQL

-- Dynamically create the expected filenames to load
-- eg, the files names `2023-01.parquet` to `2023-06.parquet`
-- a lambda function can use the result of the generate_series(1, 6) 
-- and the format command can the `{:02d}` format specifier 
-- to left-pad filenames with 0's

SELECT filename, 
min(tpep_pickup_datetime), 
max(tpep_pickup_datetime), 
avg(total_amount)
FROM read_parquet(
    list_transform(
        generate_series(1, 6),
        n -> format('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-{:02d}.parquet', n) 
    ), filename=true
)
GROUP BY filename
ORDER BY filename;

Copy code

Simon Aubury

Expand

Share link


Working with spatial data

Editor's note: Geospatial data is increasingly important for analytics - whether you're looking at data like store inventory, customer location or the weather. The spatial extension for DuckDB provides support for common data formats, calculations and searching within geometries.

Create a point from latitude and longitude pairsSQL

-- Install spatial extension
INSTALL spatial; 
LOAD spatial; 

-- Represent a latitude and longitude as a point
-- The Eiffel Tower in Paris, France has a 
-- latitude of 48.858935 and longitude of 2.293412
-- We can represent this location as a point
SELECT st_point(48.858935, 2.293412) AS Eiffel_Tower;

Copy code

Find the distance between two locations (in meters)SQL

-- Distance between the Eiffel Tower and the Arc de Triomphe in Paris
-- Using the EPSG spatial reference systems: 
-- EPSG:4326 geographic coordinates as latitude and longitude pairs
-- EPSG:27563 projection that covers northern France and uses meters 

SELECT
    st_point(48.858935, 2.293412) AS Eiffel_Tower, 
    st_point(48.873407, 2.295471) AS Arc_de_Triomphe,
    st_distance(
        st_transform(Eiffel_Tower, 'EPSG:4326', 'EPSG:27563'), 
        st_transform(Arc_de_Triomphe, 'EPSG:4326', 'EPSG:27563')
    ) AS Aerial_Distance_M;

Copy code

Find the country for given latitude and longitude locationSQL

-- Load the geometry outline for each country
-- Save the country name and "geom" border in table world_boundaries
CREATE OR REPLACE TABLE world_boundaries
AS
SELECT *
FROM st_read('https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/world-administrative-boundaries/exports/geojson');

-- Find the enclosing country for a given point
-- We can which country the Eiffel Tower is in 
SELECT name, region
FROM world_boundaries
WHERE ST_Within(st_point(2.293412, 48.858935) , geom);

Copy code

Simon Aubury

Expand

Share link