Create partitioned Parquet files from a remote CSV sourceSQL
Editor's note: DuckDB can create partitioned Parquet files - allowing you to store your data in partitions (eg orders for specific dates, traffic from specific IPs, etc) based on predictable filenames. This allows for more performant queries from cloud storage as only the needed files are retrieved.
Execute this SQL
-- Read from a remote CSV file, and write partitioned Parquet files to local target -- Queries like this are commonly used in Data Lakes COPY (SELECT cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region from read_csv_auto('https://raw.githubusercontent.com/tobilg/public-cloud-provider-ip-ranges/main/data/providers/all.csv')) TO '/tmp/ip-ranges' (FORMAT PARQUET, PARTITION_BY cloud_provider);Copy code
Expand
Share link
DuckDB in Action: Some neat DuckDB specific SQL extension
Editor's note: DuckDB strives to make it easy to write SQL, even when it requires introducing non-standard syntax. See the great blog posts by Alex Monahan or explore the Manning "DuckDB in Action" book by the author of this snippet.
DuckDB specific extensions: Project all columns matching a patternSQL
SELECT COLUMNS('valid.*') FROM prices LIMIT 3;Copy code
DuckDB specific extensions: Apply an aggregation to several columnsSQL
SELECT max(COLUMNS('valid.*')) FROM prices;Copy code
DuckDB specific extensions: Apply one condition to many columnsSQL
FROM prices WHERE COLUMNS('valid.*') BETWEEN '2020-01-01' AND '2021-01-01';Copy code
Expand
Share link
Reading a fixed-width file in DuckDBSQL
Fixed-width files can be little difficult but IF you consider each line of data as a string which you can attack with duckdb and substr() its not that difficult ;)
Execute this SQL
CREATE OR REPLACE TABLE example_table AS SELECT CAST(substr(line, 1, 4) AS INTEGER) AS activity_year, CAST(substr(line, 5, 10) AS VARCHAR(10)) AS lei_or_respondent_id, CAST(substr(line, 15, 1) AS CHAR(1)) AS agency_code, CAST(substr(line, 16, 1) AS CHAR(1)) AS loan_type, CAST(substr(line, 17, 1) AS CHAR(1)) AS loan_purpose, CAST(substr(line, 18, 1) AS CHAR(1)) AS occupancy_type FROM (SELECT column0 AS line FROM read_csv('data.txt', AUTO_DETECT=TRUE, skip=1));Copy code
Expand
Share link
Execute this SQL
-- create macro CREATE OR REPLACE MACRO udf_products_in_year (v_year, v_category) AS TABLE SELECT name, category, created_at FROM products WHERE category = v_category AND year(created_at) = v_year; -- select using the macro as you would do from a table SELECT * FROM udf_products_in_year (2020, 'Home and Garden'); | Copper Light | Home and Garden | 2020-04-05 00:00:00.000 | | Pink Armchair | Home and Garden | 2020-06-23 00:00:00.000 | -- input ddl and data CREATE TABLE products ( name varchar, category varchar, created_at timestamp ); INSERT INTO products VALUES ('Cream Sofa', 'Home and Garden', '2019-03-14'), ('Copper Light', 'Home and Garden', '2020-04-05'), ('Pink Armchair', 'Home and Garden', '2020-06-23');Copy code
Copy code
Expand
Share link
Combine several parquet files into one and compress with zstdBash
Editor's note: another great example of using DuckDB's wide data format support to merge/combine multiple Parquet files. Parth also kindly shows you how to compress the resulting Parquet file with the zstd codec. DuckDB also supports gzip and snappy compression codecs.
Execute this Bash
duckdb -c "COPY (SELECT * FROM '~/Binance_Spot_Data/*.parquet') TO 'binance.parquet' (FORMAT 'PARQUET', CODEC 'zstd')"Copy code
Expand
Share link
Put null values last when sorting (like Excel or Postgres)SQL
Editor's note: DuckDB enables you to configure whether NULL values are returned first or last in result sets by default. You can also specify it per query using NULLS LAST in the query ORDER BY clause. Note that NULLS LAST is now the default with 0.8.0+.
Execute this SQL
PRAGMA default_null_order='NULLS LAST';Copy code
Expand
Share link