Query Yahoo Finance data with Scrooge extensionSQL

Editor's note: The Scrooge McDuck extension provides common aggregation functions and data scanners used for financial data. This example grabs stock quotes (or the S&P 500 in this case) on specific dates for analysis in DuckDB.

Execute this SQL

-- Install httpfs extension
INSTALL httpfs;
LOAD httpfs;
-- Install Scrooge extension https://github.com/pdet/Scrooge-McDuck
-- NOTE: You need to start DuckDB with `-unsigned` flag to authorize to install & load of 3rd party extension
SET custom_extension_repository='scrooge-duckdb.s3.us-west-2.amazonaws.com/scrooge/s3_deploy';
INSTALL scrooge;
LOAD scrooge;
-- Example of query 
FROM yahoo_finance("^GSPC", "2023-02-01", "2023-02-04", "1d");

Copy code

Mehdi Ouazza

Expand

Share link


Unnest JSON Array into Rows (pseudo-json_each)SQL

Editor's note: as APIs and applications increasingly represent tabular data in JSON format, we sometimes want the comfort of seeing it as rows, queryable in SQL. DuckDB has a great extension for parsing JSON, including the ability to convert into a DuckDB list which can then be unnested into rows.

Execute this SQL

select 
    -- Split DuckDB list into multiple rows
    unnest(
        -- Convert JSON array into DuckDB list of JSON elements
        from_json(
            json('[{"col1":1, "col2":2}, {"col1":10, "col2":20}]'),
            '["JSON"]'
        )
    );
/* 
Returns:
{"col1":1,"col2":2}
{"col1":10,"col2":20}
*/

Copy code

Alex Monahan

Expand

Share link


Dynamic SQL in DuckDB CLISQL

Editor's note: this gets a little recursive for all those SQL nerds out there. It shows you how to use SQL to generate SQL. It also shows you some of the dot commands for interacting with the DuckDB CLI: changing the output format and writing SQL to a file which is then read in for execution.

Execute this SQL

-- Use SQL to generate SQL statements.
-- Write them to a file, then execute that file.
-- This uses DuckDB CLI-specific features, so it only works in the CLI.
.mode list
.header off
.once my_dynamic_sql.sql
select 'CREATE TABLE t' || i || ' AS SELECT ' || i || ' AS my_column;' from generate_series(5) x(i);
.read my_dynamic_sql.sql

select table_name from duckdb_tables(); -- Check our work!

Copy code

Alex Monahan

Expand

Share link


Put null values last when sorting (like Excel or Postgres)SQL

Editor's note: DuckDB enables you to configure whether NULL values are returned first or last in result sets by default. You can also specify it per query using NULLS LAST in the query ORDER BY clause. Note that NULLS LAST is now the default with 0.8.0+.

Execute this SQL

PRAGMA default_null_order='NULLS LAST';

Copy code

Matt Holden

Expand

Share link


Query S3 Access LogsSQL

Editor's note: Want to read log files with DuckDB? You can use the read_csv function and custom date/time + regex parsing to do it. To make the data more useful, you can specifically CAST some of the values as numerical types. This snippet also shows CASE WHEN ELSE statements in action.

Execute this SQL


/* 
Background: If you have S3 Access Logging enabled on one of your S3 buckets, you'll have some useful information about requests to your bucket. Unfortunately, it's in a semistructured format that can be difficult to parse. This SQL query will can help in this manner, both pulling out individual fields and coersing them to native data types.

Usage: you'll want to search for the strings <bucket> and <prefix>, and insert the S3 bucket where your access logs are being delivered. Use (or delete) <prefix> to filter to a subset of your logs.

Also, these commented out configuration settings you can either run  yourself in the REPL and source this file using `.read parse_s3_access_logs.sql`, or you can uncomment them and supply values for yourself.
*/

-- install https;
-- load https;
-- SET s3_region='us-west-2';
-- SET s3_access_key_id='';
-- SET s3_secret_access_key='';

WITH parsed_logs AS (
    SELECT
        regexp_extract(col1, '^([0-9a-zA-Z]+)\s+([a-z0-9.\-]+)\s+\[([0-9/A-Za-z: +]+)\] ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ("[^"]*"|-) ([^ ]+) ([^ ]+) (\d+|-) (\d+|-) (\d+|-) (\d+|-) ("[^"]*"|-) ("[^"]*"|-) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+)(.*)$',
        ['bucket_owner', 'bucket', 'timestamp', 'remote_ip', 'request', 'request_id', 'operation', 's3_key', 'request_uri', 'http_status', 's3_errorcode', 'bytes_sent','object_size', 'total_time', 'turn_around_time', 'referer', 'user_agent', 'version_id', 'host_id', 'sigver', 'cyphersuite', 'auth_type', 'host_header', 'tls_version', 'access_point_arn', 'acl_required', 'extra']
    ) AS log_struct
    FROM
        -- Trick the CSV reader into reading as a single column
        read_csv(
            's3://<bucket>/<prefix>/*',
            columns={'col1': 'VARCHAR'},
            -- Use a *hopefully* nonsensical deliminator, so no ',' chars screw us up
            delim='\0'
        )
)
SELECT
        -- Grab everything from the struct that we want as strings, exclude stuff we'll coersce to diff types
        log_struct.* exclude (timestamp, bytes_sent, object_size, total_time, turn_around_time),
        strptime(log_struct.timestamp, '%d/%b/%Y:%H:%M:%S %z') AS timestamp,
        CASE
                WHEN log_struct.bytes_sent = '-' THEN NULL
                ELSE CAST(log_struct.bytes_sent AS INTEGER)
        END AS bytes_sent,
        CASE
                WHEN log_struct.object_size = '-' THEN NULL
                ELSE CAST(log_struct.object_size AS INTEGER)
        END AS object_size,
        CASE
                WHEN log_struct.total_time = '-' THEN NULL
                ELSE CAST(log_struct.total_time AS INTEGER)
        END AS total_time,
        CASE
                WHEN log_struct.turn_around_time = '-' THEN NULL
                ELSE CAST(log_struct.turn_around_time AS INTEGER)
        END AS turn_around_time
FROM parsed_logs;

Copy code

Mark Roddy

Expand

Share link


DuckDB in Action: Some neat DuckDB specific SQL extension

Editor's note: DuckDB strives to make it easy to write SQL, even when it requires introducing non-standard syntax. See the great blog posts by Alex Monahan or explore the Manning "DuckDB in Action" book by the author of this snippet.

Attach and select MotherDuck database

Data shared/available on MotherDuck

ATTACH 'md:_share/duckdb_in_action_ch3_4/d0c08584-1d33-491c-8db7-cf9c6910eceb' AS duckdb_in_action_ch3_4;
USE duckdb_in_action_ch3_4;

Copy code

DuckDB specific extensions: Project all columns matching a patternSQL

SELECT COLUMNS('valid.*') FROM prices LIMIT 3;

Copy code

DuckDB specific extensions: Apply an aggregation to several columnsSQL

SELECT max(COLUMNS('valid.*')) FROM prices;

Copy code

DuckDB specific extensions: Apply one condition to many columnsSQL

FROM prices
WHERE COLUMNS('valid.*') BETWEEN '2020-01-01' AND '2021-01-01';

Copy code

Michael Simons

Expand

Share link