SQL, Python & More for DuckDB - Page 3

‌

Create partitioned Parquet files from a remote CSV sourceSQL

Editor's note: DuckDB can create partitioned Parquet files - allowing you to store your data in partitions (eg orders for specific dates, traffic from specific IPs, etc) based on predictable filenames. This allows for more performant queries from cloud storage as only the needed files are retrieved.

Execute this SQL

-- Read from a remote CSV file, and write partitioned Parquet files to local target
-- Queries like this are commonly used in Data Lakes
COPY (SELECT cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region from read_csv_auto('https://raw.githubusercontent.com/tobilg/public-cloud-provider-ip-ranges/main/data/providers/all.csv')) TO '/tmp/ip-ranges' (FORMAT PARQUET, PARTITION_BY cloud_provider);
Copy code

Tobias Müller

Edited 02/23/24

‌

DuckDB in Action: Some neat DuckDB specific SQL extension

Editor's note: DuckDB strives to make it easy to write SQL, even when it requires introducing non-standard syntax. See the great blog posts by Alex Monahan or explore the Manning "DuckDB in Action" book by the author of this snippet.

Attach and select MotherDuck database
Data shared/available on MotherDuck

ATTACH 'md:_share/duckdb_in_action_ch3_4/d0c08584-1d33-491c-8db7-cf9c6910eceb' AS duckdb_in_action_ch3_4;
USE duckdb_in_action_ch3_4;
Copy code

DuckDB specific extensions: Project all columns matching a patternSQL

SELECT COLUMNS('valid.*') FROM prices LIMIT 3;
Copy code

DuckDB specific extensions: Apply an aggregation to several columnsSQL

SELECT max(COLUMNS('valid.*')) FROM prices;
Copy code

DuckDB specific extensions: Apply one condition to many columnsSQL

FROM prices
WHERE COLUMNS('valid.*') BETWEEN '2020-01-01' AND '2021-01-01';
Copy code

Michael Simons

Edited 02/23/24

‌

Creating parameterized views with TVFs macros (Table-Valued functions)SQL

Execute this SQL

-- create macro
CREATE OR REPLACE MACRO udf_products_in_year (v_year, v_category)
AS TABLE
SELECT 
	name, 
	category, 
	created_at
FROM products 
WHERE category = v_category
AND year(created_at) = v_year;

-- select using the macro as you would do from a table
SELECT *
FROM udf_products_in_year (2020, 'Home and Garden');

| Copper Light	| Home and Garden	| 2020-04-05 00:00:00.000 |
| Pink Armchair	| Home and Garden	| 2020-06-23 00:00:00.000 |

-- input ddl and data
CREATE TABLE products 
(
	name varchar,
	category varchar,
	created_at timestamp
);

INSERT INTO products
VALUES
('Cream Sofa', 'Home and Garden', '2019-03-14'),
('Copper Light', 'Home and Garden', '2020-04-05'),
('Pink Armchair', 'Home and Garden', '2020-06-23');
Copy code

Octavian Zarzu

Created 04/27/23

‌

Combine several parquet files into one and compress with zstdBash

Editor's note: another great example of using DuckDB's wide data format support to merge/combine multiple Parquet files. Parth also kindly shows you how to compress the resulting Parquet file with the zstd codec. DuckDB also supports gzip and snappy compression codecs.

Execute this Bash

duckdb -c "COPY (SELECT * FROM '~/Binance_Spot_Data/*.parquet') TO 'binance.parquet' (FORMAT 'PARQUET', CODEC 'zstd')"
Copy code

Parth Patil

Edited 02/23/24

‌

Put null values last when sorting (like Excel or Postgres)SQL

Editor's note: DuckDB enables you to configure whether NULL values are returned first or last in result sets by default. You can also specify it per query using NULLS LAST in the query ORDER BY clause. Note that NULLS LAST is now the default with 0.8.0+.

Execute this SQL

PRAGMA default_null_order='NULLS LAST';
Copy code

Matt Holden

Edited 02/23/24

‌

Query S3 Access LogsSQL

Editor's note: Want to read log files with DuckDB? You can use the read_csv function and custom date/time + regex parsing to do it. To make the data more useful, you can specifically CAST some of the values as numerical types. This snippet also shows CASE WHEN ELSE statements in action.

Execute this SQL


/* 
Background: If you have S3 Access Logging enabled on one of your S3 buckets, you'll have some useful information about requests to your bucket. Unfortunately, it's in a semistructured format that can be difficult to parse. This SQL query will can help in this manner, both pulling out individual fields and coersing them to native data types.

Usage: you'll want to search for the strings <bucket> and <prefix>, and insert the S3 bucket where your access logs are being delivered. Use (or delete) <prefix> to filter to a subset of your logs.

Also, these commented out configuration settings you can either run  yourself in the REPL and source this file using `.read parse_s3_access_logs.sql`, or you can uncomment them and supply values for yourself.
*/

-- install https;
-- load https;
-- SET s3_region='us-west-2';
-- SET s3_access_key_id='';
-- SET s3_secret_access_key='';

WITH parsed_logs AS (
    SELECT
        regexp_extract(col1, '^([0-9a-zA-Z]+)\s+([a-z0-9.\-]+)\s+\[([0-9/A-Za-z: +]+)\] ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ("[^"]*"|-) ([^ ]+) ([^ ]+) (\d+|-) (\d+|-) (\d+|-) (\d+|-) ("[^"]*"|-) ("[^"]*"|-) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+)(.*)$',
        ['bucket_owner', 'bucket', 'timestamp', 'remote_ip', 'request', 'request_id', 'operation', 's3_key', 'request_uri', 'http_status', 's3_errorcode', 'bytes_sent','object_size', 'total_time', 'turn_around_time', 'referer', 'user_agent', 'version_id', 'host_id', 'sigver', 'cyphersuite', 'auth_type', 'host_header', 'tls_version', 'access_point_arn', 'acl_required', 'extra']
    ) AS log_struct
    FROM
        -- Trick the CSV reader into reading as a single column
        read_csv(
            's3://<bucket>/<prefix>/*',
            columns={'col1': 'VARCHAR'},
            -- Use a *hopefully* nonsensical deliminator, so no ',' chars screw us up
            delim='\0'
        )
)
SELECT
        -- Grab everything from the struct that we want as strings, exclude stuff we'll coersce to diff types
        log_struct.* exclude (timestamp, bytes_sent, object_size, total_time, turn_around_time),
        strptime(log_struct.timestamp, '%d/%b/%Y:%H:%M:%S %z') AS timestamp,
        CASE
                WHEN log_struct.bytes_sent = '-' THEN NULL
                ELSE CAST(log_struct.bytes_sent AS INTEGER)
        END AS bytes_sent,
        CASE
                WHEN log_struct.object_size = '-' THEN NULL
                ELSE CAST(log_struct.object_size AS INTEGER)
        END AS object_size,
        CASE
                WHEN log_struct.total_time = '-' THEN NULL
                ELSE CAST(log_struct.total_time AS INTEGER)
        END AS total_time,
        CASE
                WHEN log_struct.turn_around_time = '-' THEN NULL
                ELSE CAST(log_struct.turn_around_time AS INTEGER)
        END AS turn_around_time
FROM parsed_logs;
Copy code

Mark Roddy

Edited 02/23/24

Create partitioned Parquet files from a remote CSV sourceSQL

Execute this SQL

DuckDB in Action: Some neat DuckDB specific SQL extension

Attach and select MotherDuck databaseData shared/available on MotherDuck

DuckDB specific extensions: Project all columns matching a patternSQL

DuckDB specific extensions: Apply an aggregation to several columnsSQL

DuckDB specific extensions: Apply one condition to many columnsSQL

Creating parameterized views with TVFs macros (Table-Valued functions)SQL

Execute this SQL

Combine several parquet files into one and compress with zstdBash

Execute this Bash

Put null values last when sorting (like Excel or Postgres)SQL

Execute this SQL

Query S3 Access LogsSQL

Execute this SQL

Attach and select MotherDuck database
Data shared/available on MotherDuck