Read an MS Excel File with the spatial extensionSQL
Editor's note: It might seem a bit odd, but the DuckDB spatial extension includes a function for reading Microsoft Excel XLSX files into DuckDB. This is because a lot of geospatial files are shared this way, but you can take advantage of this capability even if you have no spatial data!
Execute this SQL
install spatial; load spatial; from st_read('file.xlsx',layer='sheet_name');
Copy code
Query S3 Access LogsSQL
Editor's note: Want to read log files with DuckDB? You can use the read_csv function and custom date/time + regex parsing to do it. To make the data more useful, you can specifically CAST some of the values as numerical types. This snippet also shows CASE WHEN ELSE statements in action.
Execute this SQL
/* Background: If you have S3 Access Logging enabled on one of your S3 buckets, you'll have some useful information about requests to your bucket. Unfortunately, it's in a semistructured format that can be difficult to parse. This SQL query will can help in this manner, both pulling out individual fields and coersing them to native data types. Usage: you'll want to search for the strings <bucket> and <prefix>, and insert the S3 bucket where your access logs are being delivered. Use (or delete) <prefix> to filter to a subset of your logs. Also, these commented out configuration settings you can either run yourself in the REPL and source this file using `.read parse_s3_access_logs.sql`, or you can uncomment them and supply values for yourself. */ -- install https; -- load https; -- SET s3_region='us-west-2'; -- SET s3_access_key_id=''; -- SET s3_secret_access_key=''; WITH parsed_logs AS ( SELECT regexp_extract(col1, '^([0-9a-zA-Z]+)\s+([a-z0-9.\-]+)\s+\[([0-9/A-Za-z: +]+)\] ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ("[^"]*"|-) ([^ ]+) ([^ ]+) (\d+|-) (\d+|-) (\d+|-) (\d+|-) ("[^"]*"|-) ("[^"]*"|-) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+)(.*)$', ['bucket_owner', 'bucket', 'timestamp', 'remote_ip', 'request', 'request_id', 'operation', 's3_key', 'request_uri', 'http_status', 's3_errorcode', 'bytes_sent','object_size', 'total_time', 'turn_around_time', 'referer', 'user_agent', 'version_id', 'host_id', 'sigver', 'cyphersuite', 'auth_type', 'host_header', 'tls_version', 'access_point_arn', 'acl_required', 'extra'] ) AS log_struct FROM -- Trick the CSV reader into reading as a single column read_csv( 's3://<bucket>/<prefix>/*', columns={'col1': 'VARCHAR'}, -- Use a *hopefully* nonsensical deliminator, so no ',' chars screw us up delim='\0' ) ) SELECT -- Grab everything from the struct that we want as strings, exclude stuff we'll coersce to diff types log_struct.* exclude (timestamp, bytes_sent, object_size, total_time, turn_around_time), strptime(log_struct.timestamp, '%d/%b/%Y:%H:%M:%S %z') AS timestamp, CASE WHEN log_struct.bytes_sent = '-' THEN NULL ELSE CAST(log_struct.bytes_sent AS INTEGER) END AS bytes_sent, CASE WHEN log_struct.object_size = '-' THEN NULL ELSE CAST(log_struct.object_size AS INTEGER) END AS object_size, CASE WHEN log_struct.total_time = '-' THEN NULL ELSE CAST(log_struct.total_time AS INTEGER) END AS total_time, CASE WHEN log_struct.turn_around_time = '-' THEN NULL ELSE CAST(log_struct.turn_around_time AS INTEGER) END AS turn_around_time FROM parsed_logs;
Copy code
Expand
Share link
Put null values last when sorting (like Excel or Postgres)SQL
Editor's note: DuckDB enables you to configure whether NULL values are returned first or last in result sets by default. You can also specify it per query using NULLS LAST in the query ORDER BY clause. Note that NULLS LAST is now the default with 0.8.0+.
Execute this SQL
PRAGMA default_null_order='NULLS LAST';
Copy code
Expand
Share link
Execute this SQL
-- create macro CREATE OR REPLACE MACRO udf_products_in_year (v_year, v_category) AS TABLE SELECT name, category, created_at FROM products WHERE category = v_category AND year(created_at) = v_year; -- select using the macro as you would do from a table SELECT * FROM udf_products_in_year (2020, 'Home and Garden'); | Copper Light | Home and Garden | 2020-04-05 00:00:00.000 | | Pink Armchair | Home and Garden | 2020-06-23 00:00:00.000 | -- input ddl and data CREATE TABLE products ( name varchar, category varchar, created_at timestamp ); INSERT INTO products VALUES ('Cream Sofa', 'Home and Garden', '2019-03-14'), ('Copper Light', 'Home and Garden', '2020-04-05'), ('Pink Armchair', 'Home and Garden', '2020-06-23');
Copy code
Copy code
Expand
Share link
Execute this SQL
/* removes duplicate rows at the order_id level */ SELECT * FROM orders QUALIFY row_number() over (partition by order_id order by created_at) = 1
Copy code
Copy code
Expand
Share link
Query Parquet data in S3Bash
Editor's note: DuckDB users often work with files in Parquet format, which has become a standard for representing data in data lakes. While DuckDB lets you work with local Parquet files, you can also use files stored in blob storage such as Amazon AWS S3, Azure Blob Storage and Google Cloud Storage.
Execute this Bash
# Assuming you have the following environment variables defined: # AWS_ACCESS_KEY_ID # AWS_SECRET_ACCESS_KEY # AWS_DEFAULT_REGION duckdb -c 'LOAD httpfs; SELECT count(*) FROM read_parquet("s3://<bucket>/<prefix>/*.parquet");'
Copy code
Expand
Share link
SUMMARIZESQL
Editor's note: the SUMMARIZE() function allows you to quickly understand your data. If you want to understand a little more about how it works under the hood, see Hamilton's other snippet on building your own SUMMARIZE() capabilities using built-in analytics functions.
Execute this SQL
SUMMARIZE some_tbl; SUMMARIZE from 'some_file.csv'; --- summary informations about a given table
Copy code
Expand
Share link
Compute a metric for each numeric column and return the values in a long table (requires 0.7.2+)SQL
Editor's note: if your data originates as different types or in a format like CSV, you might want to do so without risking throwing an error for oddly-typed values. You can do so with TRY_CAST(), which will attempt a CAST but return NULL if not possible.
Execute this SQL
with computed as ( select sum(try_cast(columns(*) as double)) from read_csv_auto('aapl.csv') ) select -- restore original column names trim(list_element(regexp_extract_all(name,'\.(.*?) AS',1),1),'"') as name, value from (pivot_longer computed on *)
Copy code
Expand
Share link