SQL, Python & More for DuckDB - Page 12

‌

Query CloudTrail logs in S3SQL

Execute this SQL

-- Background: CloudTrail logs contain a significant amount of data about different actions that occur in your AWS account but can be challenging to deal with for a couple reasons:
--   1. Each file is a single JSON object with a list of of records in a "Records" key
--   2. The schema varies wildly depending on the type of AWS service and actions that are being used.

-- CloudTrail logs typically go to: s3://<bucket>/AWSLogs/<accountid>/CloudTrail/<region>/<year>/<month>/<day>/<accountid-region-some-unique-id>.json.gz"

-- To use this query, you need to load the httpfs extension and either set the following s3_ variables or ensure your standard AWS secret key environment variables are set

-- install https;
-- load https;
-- SET s3_region='us-west-2';
-- SET s3_access_key_id='';
-- SET s3_secret_access_key='';

-- The following query returns the first 10 logs from today.
WITH parsed_logs AS (
    select UNNEST(Records, recursive := True)
    from read_json_auto(
        "s3://<bucket>/AWSLogs/<accountid>/CloudTrail/<region>/" || strftime(now(), "%Y/%m/%d") || "/*.json.gz" ,
        union_by_name=True
    )
)
SELECT * FROM parsed_logs LIMIT 10;
Copy code

Damon

Created 06/08/23

‌

Python Scalar UDF using ArrowPython

Execute this Python

# Creates a lambda that adds X+Y using pyarrow.compute
# This lambda will be called once for every 2048 rows
import duckdb
from duckdb
import pyarrow as pa
import pyarrow.compute as pc
from duckdb.typing import *
con=duckdb.connect()
con.create_function('plus_lambda', lambda x,y: pc.add(x, y), [BIGINT, BIGINT], BIGINT, type='arrow')
res = con.execute('select plus_lambda(i, 5) from range(11) tbl(i)').df()
print(res)
con.remove_function('plus_lambda')
Copy code

Created 05/23/23

‌

Read an MS Excel File with the spatial extensionSQL

Editor's note: It might seem a bit odd, but the DuckDB spatial extension includes a function for reading Microsoft Excel XLSX files into DuckDB. This is because a lot of geospatial files are shared this way, but you can take advantage of this capability even if you have no spatial data!

Execute this SQL

install spatial;
load spatial;
from st_read('file.xlsx',layer='sheet_name');
Copy code

Edited 02/23/24

‌

Query S3 Access LogsSQL

Editor's note: Want to read log files with DuckDB? You can use the read_csv function and custom date/time + regex parsing to do it. To make the data more useful, you can specifically CAST some of the values as numerical types. This snippet also shows CASE WHEN ELSE statements in action.

Execute this SQL


/* 
Background: If you have S3 Access Logging enabled on one of your S3 buckets, you'll have some useful information about requests to your bucket. Unfortunately, it's in a semistructured format that can be difficult to parse. This SQL query will can help in this manner, both pulling out individual fields and coersing them to native data types.

Usage: you'll want to search for the strings <bucket> and <prefix>, and insert the S3 bucket where your access logs are being delivered. Use (or delete) <prefix> to filter to a subset of your logs.

Also, these commented out configuration settings you can either run  yourself in the REPL and source this file using `.read parse_s3_access_logs.sql`, or you can uncomment them and supply values for yourself.
*/

-- install https;
-- load https;
-- SET s3_region='us-west-2';
-- SET s3_access_key_id='';
-- SET s3_secret_access_key='';

WITH parsed_logs AS (
    SELECT
        regexp_extract(col1, '^([0-9a-zA-Z]+)\s+([a-z0-9.\-]+)\s+\[([0-9/A-Za-z: +]+)\] ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ("[^"]*"|-) ([^ ]+) ([^ ]+) (\d+|-) (\d+|-) (\d+|-) (\d+|-) ("[^"]*"|-) ("[^"]*"|-) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+)(.*)$',
        ['bucket_owner', 'bucket', 'timestamp', 'remote_ip', 'request', 'request_id', 'operation', 's3_key', 'request_uri', 'http_status', 's3_errorcode', 'bytes_sent','object_size', 'total_time', 'turn_around_time', 'referer', 'user_agent', 'version_id', 'host_id', 'sigver', 'cyphersuite', 'auth_type', 'host_header', 'tls_version', 'access_point_arn', 'acl_required', 'extra']
    ) AS log_struct
    FROM
        -- Trick the CSV reader into reading as a single column
        read_csv(
            's3://<bucket>/<prefix>/*',
            columns={'col1': 'VARCHAR'},
            -- Use a *hopefully* nonsensical deliminator, so no ',' chars screw us up
            delim='\0'
        )
)
SELECT
        -- Grab everything from the struct that we want as strings, exclude stuff we'll coersce to diff types
        log_struct.* exclude (timestamp, bytes_sent, object_size, total_time, turn_around_time),
        strptime(log_struct.timestamp, '%d/%b/%Y:%H:%M:%S %z') AS timestamp,
        CASE
                WHEN log_struct.bytes_sent = '-' THEN NULL
                ELSE CAST(log_struct.bytes_sent AS INTEGER)
        END AS bytes_sent,
        CASE
                WHEN log_struct.object_size = '-' THEN NULL
                ELSE CAST(log_struct.object_size AS INTEGER)
        END AS object_size,
        CASE
                WHEN log_struct.total_time = '-' THEN NULL
                ELSE CAST(log_struct.total_time AS INTEGER)
        END AS total_time,
        CASE
                WHEN log_struct.turn_around_time = '-' THEN NULL
                ELSE CAST(log_struct.turn_around_time AS INTEGER)
        END AS turn_around_time
FROM parsed_logs;
Copy code

Mark Roddy

Edited 02/23/24

‌

Put null values last when sorting (like Excel or Postgres)SQL

Editor's note: DuckDB enables you to configure whether NULL values are returned first or last in result sets by default. You can also specify it per query using NULLS LAST in the query ORDER BY clause. Note that NULLS LAST is now the default with 0.8.0+.

Execute this SQL

PRAGMA default_null_order='NULLS LAST';
Copy code

Matt Holden

Edited 02/23/24

‌

Creating parameterized views with TVFs macros (Table-Valued functions)SQL

Execute this SQL

-- create macro
CREATE OR REPLACE MACRO udf_products_in_year (v_year, v_category)
AS TABLE
SELECT 
	name, 
	category, 
	created_at
FROM products 
WHERE category = v_category
AND year(created_at) = v_year;

-- select using the macro as you would do from a table
SELECT *
FROM udf_products_in_year (2020, 'Home and Garden');

| Copper Light	| Home and Garden	| 2020-04-05 00:00:00.000 |
| Pink Armchair	| Home and Garden	| 2020-06-23 00:00:00.000 |

-- input ddl and data
CREATE TABLE products 
(
	name varchar,
	category varchar,
	created_at timestamp
);

INSERT INTO products
VALUES
('Cream Sofa', 'Home and Garden', '2019-03-14'),
('Copper Light', 'Home and Garden', '2020-04-05'),
('Pink Armchair', 'Home and Garden', '2020-06-23');
Copy code

Octavian Zarzu

Created 04/27/23

‌

Remove duplicatesSQL

Execute this SQL

/* removes duplicate rows at the order_id level */
SELECT * FROM orders
QUALIFY row_number() over (partition by order_id order by created_at) = 1
Copy code

Octavian Zarzu

Created 04/27/23

‌

Query Parquet data in S3Bash

Editor's note: DuckDB users often work with files in Parquet format, which has become a standard for representing data in data lakes. While DuckDB lets you work with local Parquet files, you can also use files stored in blob storage such as Amazon AWS S3, Azure Blob Storage and Google Cloud Storage.

Execute this Bash

# Assuming you have the following environment variables defined:
# AWS_ACCESS_KEY_ID
# AWS_SECRET_ACCESS_KEY
# AWS_DEFAULT_REGION
duckdb -c 'LOAD httpfs; SELECT count(*) FROM read_parquet("s3://<bucket>/<prefix>/*.parquet");'
Copy code

Damon

Edited 02/23/24