Back to Code Snippets
@mark-1
Query S3 Access LogsSQL
Editor's note: Want to read log files with DuckDB? You can use the read_csv function and custom date/time + regex parsing to do it. To make the data more useful, you can specifically CAST some of the values as numerical types. This snippet also shows CASE WHEN ELSE statements in action.
Execute this SQL
/* Background: If you have S3 Access Logging enabled on one of your S3 buckets, you'll have some useful information about requests to your bucket. Unfortunately, it's in a semistructured format that can be difficult to parse. This SQL query will can help in this manner, both pulling out individual fields and coersing them to native data types. Usage: you'll want to search for the strings <bucket> and <prefix>, and insert the S3 bucket where your access logs are being delivered. Use (or delete) <prefix> to filter to a subset of your logs. Also, these commented out configuration settings you can either run yourself in the REPL and source this file using `.read parse_s3_access_logs.sql`, or you can uncomment them and supply values for yourself. */ -- install https; -- load https; -- SET s3_region='us-west-2'; -- SET s3_access_key_id=''; -- SET s3_secret_access_key=''; WITH parsed_logs AS ( SELECT regexp_extract(col1, '^([0-9a-zA-Z]+)\s+([a-z0-9.\-]+)\s+\[([0-9/A-Za-z: +]+)\] ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ("[^"]*"|-) ([^ ]+) ([^ ]+) (\d+|-) (\d+|-) (\d+|-) (\d+|-) ("[^"]*"|-) ("[^"]*"|-) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+)(.*)$', ['bucket_owner', 'bucket', 'timestamp', 'remote_ip', 'request', 'request_id', 'operation', 's3_key', 'request_uri', 'http_status', 's3_errorcode', 'bytes_sent','object_size', 'total_time', 'turn_around_time', 'referer', 'user_agent', 'version_id', 'host_id', 'sigver', 'cyphersuite', 'auth_type', 'host_header', 'tls_version', 'access_point_arn', 'acl_required', 'extra'] ) AS log_struct FROM -- Trick the CSV reader into reading as a single column read_csv( 's3://<bucket>/<prefix>/*', columns={'col1': 'VARCHAR'}, -- Use a *hopefully* nonsensical deliminator, so no ',' chars screw us up delim='\0' ) ) SELECT -- Grab everything from the struct that we want as strings, exclude stuff we'll coersce to diff types log_struct.* exclude (timestamp, bytes_sent, object_size, total_time, turn_around_time), strptime(log_struct.timestamp, '%d/%b/%Y:%H:%M:%S %z') AS timestamp, CASE WHEN log_struct.bytes_sent = '-' THEN NULL ELSE CAST(log_struct.bytes_sent AS INTEGER) END AS bytes_sent, CASE WHEN log_struct.object_size = '-' THEN NULL ELSE CAST(log_struct.object_size AS INTEGER) END AS object_size, CASE WHEN log_struct.total_time = '-' THEN NULL ELSE CAST(log_struct.total_time AS INTEGER) END AS total_time, CASE WHEN log_struct.turn_around_time = '-' THEN NULL ELSE CAST(log_struct.turn_around_time AS INTEGER) END AS turn_around_time FROM parsed_logs;
Copy code
Expand
Share link
Parse a File in an Unsupported FormatSQL
Editor's note: as data engineers, we're often burdened with data that's not in a standard format. Using DuckDB's basic string functions, advanced regex functions, list functions and the CSV parser, you can parse data of arbitrary formats.
Execute this SQL
/* This query pulls the contents of a file into a a single row/column, and then parses it in the select statement. As an example, we're parsing /proc/{pid}/environ files which are record separated by the null char (\0) and column separated by the '=' char. */ SELECT str_split(filename, '/')[3] AS PID, env_pair[1] AS Name, env_pair[2] AS Value FROM ( SELECT filename, str_split(unnest(str_split(column0, chr(0))), '=') AS env_pair FROM read_csv( '/proc/[0-9]*/environ', header=False, filename=true, -- Make sure your 'delim' is a string not in the file delim='\0', columns={column0: 'VARCHAR'}) ) WHERE Name != '' ORDER BY PID, Name
Copy code
Expand
Share link
Query the Output of Another ProcessBash
Editor's note: if you're executing command-line interfaces that output JSON, CSV or other common formats, DuckDB enables you to do ad hoc queries on the results using SQL.
Execute this Bash
# Only using JSON as an example format, not required. cat some-file.json | duckdb -s "SELECT * FROM read_json_auto('/dev/stdin')";
Copy code
Expand
Share link