Mark Roddy | DuckDB Snippets

Back to Code Snippets

Mark Roddy

@mark-1

‌

Query S3 Access LogsSQL

Editor's note: Want to read log files with DuckDB? You can use the read_csv function and custom date/time + regex parsing to do it. To make the data more useful, you can specifically CAST some of the values as numerical types. This snippet also shows CASE WHEN ELSE statements in action.

Execute this SQL


/* 
Background: If you have S3 Access Logging enabled on one of your S3 buckets, you'll have some useful information about requests to your bucket. Unfortunately, it's in a semistructured format that can be difficult to parse. This SQL query will can help in this manner, both pulling out individual fields and coersing them to native data types.

Usage: you'll want to search for the strings <bucket> and <prefix>, and insert the S3 bucket where your access logs are being delivered. Use (or delete) <prefix> to filter to a subset of your logs.

Also, these commented out configuration settings you can either run  yourself in the REPL and source this file using `.read parse_s3_access_logs.sql`, or you can uncomment them and supply values for yourself.
*/

-- install https;
-- load https;
-- SET s3_region='us-west-2';
-- SET s3_access_key_id='';
-- SET s3_secret_access_key='';

WITH parsed_logs AS (
    SELECT
        regexp_extract(col1, '^([0-9a-zA-Z]+)\s+([a-z0-9.\-]+)\s+\[([0-9/A-Za-z: +]+)\] ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ("[^"]*"|-) ([^ ]+) ([^ ]+) (\d+|-) (\d+|-) (\d+|-) (\d+|-) ("[^"]*"|-) ("[^"]*"|-) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+)(.*)$',
        ['bucket_owner', 'bucket', 'timestamp', 'remote_ip', 'request', 'request_id', 'operation', 's3_key', 'request_uri', 'http_status', 's3_errorcode', 'bytes_sent','object_size', 'total_time', 'turn_around_time', 'referer', 'user_agent', 'version_id', 'host_id', 'sigver', 'cyphersuite', 'auth_type', 'host_header', 'tls_version', 'access_point_arn', 'acl_required', 'extra']
    ) AS log_struct
    FROM
        -- Trick the CSV reader into reading as a single column
        read_csv(
            's3://<bucket>/<prefix>/*',
            columns={'col1': 'VARCHAR'},
            -- Use a *hopefully* nonsensical deliminator, so no ',' chars screw us up
            delim='\0'
        )
)
SELECT
        -- Grab everything from the struct that we want as strings, exclude stuff we'll coersce to diff types
        log_struct.* exclude (timestamp, bytes_sent, object_size, total_time, turn_around_time),
        strptime(log_struct.timestamp, '%d/%b/%Y:%H:%M:%S %z') AS timestamp,
        CASE
                WHEN log_struct.bytes_sent = '-' THEN NULL
                ELSE CAST(log_struct.bytes_sent AS INTEGER)
        END AS bytes_sent,
        CASE
                WHEN log_struct.object_size = '-' THEN NULL
                ELSE CAST(log_struct.object_size AS INTEGER)
        END AS object_size,
        CASE
                WHEN log_struct.total_time = '-' THEN NULL
                ELSE CAST(log_struct.total_time AS INTEGER)
        END AS total_time,
        CASE
                WHEN log_struct.turn_around_time = '-' THEN NULL
                ELSE CAST(log_struct.turn_around_time AS INTEGER)
        END AS turn_around_time
FROM parsed_logs;
Copy code

Mark Roddy

Edited 02/23/24

‌

Parse a File in an Unsupported FormatSQL

Editor's note: as data engineers, we're often burdened with data that's not in a standard format. Using DuckDB's basic string functions, advanced regex functions, list functions and the CSV parser, you can parse data of arbitrary formats.

Execute this SQL

/*
  This query pulls the contents of a file into a a single row/column, 
  and then parses it in the select statement. As an example, we're 
  parsing /proc/{pid}/environ files which are record separated by the 
  null char (\0) and column separated by the '=' char.
*/
SELECT
  str_split(filename, '/')[3] AS PID,   
  env_pair[1] AS Name,   
  env_pair[2] AS Value 
FROM (    
  SELECT
    filename,
    str_split(unnest(str_split(column0, chr(0))), '=') AS env_pair
    FROM read_csv(
           '/proc/[0-9]*/environ',
           header=False,
           filename=true,
           -- Make sure your 'delim' is a string not in the file
           delim='\0',
           columns={column0: 'VARCHAR'})
)
WHERE Name != ''
ORDER BY PID, Name
Copy code

Mark Roddy

Edited 02/23/24

‌

Query the Output of Another ProcessBash

Editor's note: if you're executing command-line interfaces that output JSON, CSV or other common formats, DuckDB enables you to do ad hoc queries on the results using SQL.

Execute this Bash

# Only using JSON as an example format, not required.
cat some-file.json | duckdb -s "SELECT * FROM read_json_auto('/dev/stdin')";
Copy code

Mark Roddy

Edited 02/23/24