Put null values last when sorting (like Excel or Postgres)SQL
Editor's note: DuckDB enables you to configure whether NULL values are returned first or last in result sets by default. You can also specify it per query using NULLS LAST in the query ORDER BY clause. Note that NULLS LAST is now the default with 0.8.0+.
Execute this SQL
PRAGMA default_null_order='NULLS LAST';
Copy code
Expand
Share link
Query S3 Access LogsSQL
Editor's note: Want to read log files with DuckDB? You can use the read_csv function and custom date/time + regex parsing to do it. To make the data more useful, you can specifically CAST some of the values as numerical types. This snippet also shows CASE WHEN ELSE statements in action.
Execute this SQL
/* Background: If you have S3 Access Logging enabled on one of your S3 buckets, you'll have some useful information about requests to your bucket. Unfortunately, it's in a semistructured format that can be difficult to parse. This SQL query will can help in this manner, both pulling out individual fields and coersing them to native data types. Usage: you'll want to search for the strings <bucket> and <prefix>, and insert the S3 bucket where your access logs are being delivered. Use (or delete) <prefix> to filter to a subset of your logs. Also, these commented out configuration settings you can either run yourself in the REPL and source this file using `.read parse_s3_access_logs.sql`, or you can uncomment them and supply values for yourself. */ -- install https; -- load https; -- SET s3_region='us-west-2'; -- SET s3_access_key_id=''; -- SET s3_secret_access_key=''; WITH parsed_logs AS ( SELECT regexp_extract(col1, '^([0-9a-zA-Z]+)\s+([a-z0-9.\-]+)\s+\[([0-9/A-Za-z: +]+)\] ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ("[^"]*"|-) ([^ ]+) ([^ ]+) (\d+|-) (\d+|-) (\d+|-) (\d+|-) ("[^"]*"|-) ("[^"]*"|-) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+)(.*)$', ['bucket_owner', 'bucket', 'timestamp', 'remote_ip', 'request', 'request_id', 'operation', 's3_key', 'request_uri', 'http_status', 's3_errorcode', 'bytes_sent','object_size', 'total_time', 'turn_around_time', 'referer', 'user_agent', 'version_id', 'host_id', 'sigver', 'cyphersuite', 'auth_type', 'host_header', 'tls_version', 'access_point_arn', 'acl_required', 'extra'] ) AS log_struct FROM -- Trick the CSV reader into reading as a single column read_csv( 's3://<bucket>/<prefix>/*', columns={'col1': 'VARCHAR'}, -- Use a *hopefully* nonsensical deliminator, so no ',' chars screw us up delim='\0' ) ) SELECT -- Grab everything from the struct that we want as strings, exclude stuff we'll coersce to diff types log_struct.* exclude (timestamp, bytes_sent, object_size, total_time, turn_around_time), strptime(log_struct.timestamp, '%d/%b/%Y:%H:%M:%S %z') AS timestamp, CASE WHEN log_struct.bytes_sent = '-' THEN NULL ELSE CAST(log_struct.bytes_sent AS INTEGER) END AS bytes_sent, CASE WHEN log_struct.object_size = '-' THEN NULL ELSE CAST(log_struct.object_size AS INTEGER) END AS object_size, CASE WHEN log_struct.total_time = '-' THEN NULL ELSE CAST(log_struct.total_time AS INTEGER) END AS total_time, CASE WHEN log_struct.turn_around_time = '-' THEN NULL ELSE CAST(log_struct.turn_around_time AS INTEGER) END AS turn_around_time FROM parsed_logs;
Copy code
Expand
Share link
build-your-own SUMMARIZE with columns(*), unpivot, and struct.*SQL
Editor's note: Although SUMMARIZE() is fully supported in DuckDB, Hamilton shows you how to build this capability yourself using more primitive functions.
Execute this SQL
-- C = number of columns in TBL, S = number of aggregates -- ------------------------------------------------------ -- (1) use columns(*) to produce S aggregates -- for each column as a single struct, -- producing 1 struct row and C columns with aggregate as ( from sf311 select { name: first(alias(columns(*))), type: first(typeof(columns(*))), max: max(columns(*))::varchar, min: min(columns(*))::varchar, approx_unique: approx_count_distinct(columns(*)), nulls: count(*) - count(columns(*)) } ), -- (2) unpivot to get a result with C rows and 1 column of structs columns as (unpivot aggregate on columns(*)) -- (3) explode the struct to get a result with C rows and S columns select value.* from columns;
Copy code
Expand
Share link
Working with spatial data
Editor's note: Geospatial data is increasingly important for analytics - whether you're looking at data like store inventory, customer location or the weather. The spatial extension for DuckDB provides support for common data formats, calculations and searching within geometries.
Create a point from latitude and longitude pairsSQL
-- Install spatial extension INSTALL spatial; LOAD spatial; -- Represent a latitude and longitude as a point -- The Eiffel Tower in Paris, France has a -- latitude of 48.858935 and longitude of 2.293412 -- We can represent this location as a point SELECT st_point(48.858935, 2.293412) AS Eiffel_Tower;
Copy code
Find the distance between two locations (in meters)SQL
-- Distance between the Eiffel Tower and the Arc de Triomphe in Paris -- Using the EPSG spatial reference systems: -- EPSG:4326 geographic coordinates as latitude and longitude pairs -- EPSG:27563 projection that covers northern France and uses meters SELECT st_point(48.858935, 2.293412) AS Eiffel_Tower, st_point(48.873407, 2.295471) AS Arc_de_Triomphe, st_distance( st_transform(Eiffel_Tower, 'EPSG:4326', 'EPSG:27563'), st_transform(Arc_de_Triomphe, 'EPSG:4326', 'EPSG:27563') ) AS Aerial_Distance_M;
Copy code
Find the country for given latitude and longitude locationSQL
-- Load the geometry outline for each country -- Save the country name and "geom" border in table world_boundaries CREATE OR REPLACE TABLE world_boundaries AS SELECT * FROM st_read('https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/world-administrative-boundaries/exports/geojson'); -- Find the enclosing country for a given point -- We can which country the Eiffel Tower is in SELECT name, region FROM world_boundaries WHERE ST_Within(st_point(2.293412, 48.858935) , geom);
Copy code
Expand
Share link
Working with public REST API's
Editor's note: by combining the read_json capability of DuckDB with the ability to retrieve arbitrary https URLs, you can easily use DuckDB to read data from public and unprotected APIs. Note that you can't currently pass arbitrary HTTP headers to access protected APIs.
Query JSON from a public REST APISQL
-- DuckDB can directly query a public REST API returning a JSON -- Query the "tvmaze.com" API for the TV show "South Park" SELECT name, type, summary FROM read_json('https://api.tvmaze.com/singlesearch/shows?q=South%20Park', auto_detect=true, format='newline_delimited');
Copy code
Format the response from a public REST API SQL
-- Format the response from a public REST API -- Retrieve a list of temperature readings for Sydney, Australia -- The returned payload has hourly temps for 24 hours -- We can use JSONPath to extract data from a JSON documemnt -- to find the most recent temperature SELECT json_extract(hourly, '$.temperature_2m[1]') from read_json('https://api.open-meteo.com/v1/forecast?latitude=-33.8678&longitude=151.2073&hourly=temperature_2m&forecast_days=1', auto_detect=true, format='newline_delimited');
Copy code
Expand
Share link