Damon | DuckDB Snippets

Back to Code Snippets

Damon

@dacort

‌

Query CloudTrail logs in S3SQL

Execute this SQL

-- Background: CloudTrail logs contain a significant amount of data about different actions that occur in your AWS account but can be challenging to deal with for a couple reasons:
--   1. Each file is a single JSON object with a list of of records in a "Records" key
--   2. The schema varies wildly depending on the type of AWS service and actions that are being used.

-- CloudTrail logs typically go to: s3://<bucket>/AWSLogs/<accountid>/CloudTrail/<region>/<year>/<month>/<day>/<accountid-region-some-unique-id>.json.gz"

-- To use this query, you need to load the httpfs extension and either set the following s3_ variables or ensure your standard AWS secret key environment variables are set

-- install https;
-- load https;
-- SET s3_region='us-west-2';
-- SET s3_access_key_id='';
-- SET s3_secret_access_key='';

-- The following query returns the first 10 logs from today.
WITH parsed_logs AS (
    select UNNEST(Records, recursive := True)
    from read_json_auto(
        "s3://<bucket>/AWSLogs/<accountid>/CloudTrail/<region>/" || strftime(now(), "%Y/%m/%d") || "/*.json.gz" ,
        union_by_name=True
    )
)
SELECT * FROM parsed_logs LIMIT 10;
Copy code

Damon

Created 06/08/23

‌

Query Parquet data in S3Bash

Editor's note: DuckDB users often work with files in Parquet format, which has become a standard for representing data in data lakes. While DuckDB lets you work with local Parquet files, you can also use files stored in blob storage such as Amazon AWS S3, Azure Blob Storage and Google Cloud Storage.

Execute this Bash

# Assuming you have the following environment variables defined:
# AWS_ACCESS_KEY_ID
# AWS_SECRET_ACCESS_KEY
# AWS_DEFAULT_REGION
duckdb -c 'LOAD httpfs; SELECT count(*) FROM read_parquet("s3://<bucket>/<prefix>/*.parquet");'
Copy code

Damon

Edited 02/23/24