Loading remote parquet files

Editor's note: DuckDB supports lambda functions for easy transformation. It also supports reading data over http/https [in addition from blob stores]. Simon's snippet shows you how to use both of those capabilities to dynamically generate filenames and query them when globbing is unsupported.

Load a single parquet fileSQL

-- Query a single remote parquet file 
-- Query the January data from NY taxi data set
SELECT tpep_pickup_datetime, total_amount
FROM read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet')
LIMIT 10;
Copy code

Dynamically iterate over remote parquet files - lambda function and list_transformSQL

-- Dynamically create the expected filenames to load
-- eg, the files names `2023-01.parquet` to `2023-06.parquet`
-- a lambda function can use the result of the generate_series(1, 6) 
-- and the format command can the `{:02d}` format specifier 
-- to left-pad filenames with 0's

SELECT filename, 
min(tpep_pickup_datetime), 
max(tpep_pickup_datetime), 
avg(total_amount)
FROM read_parquet(
    list_transform(
        generate_series(1, 6),
        n -> format('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-{:02d}.parquet', n) 
    ), filename=true
)
GROUP BY filename
ORDER BY filename;
Copy code

Simon Aubury

Edited 02/23/24