Read an MS Excel File with the spatial extensionSQL

Editor's note: It might seem a bit odd, but the DuckDB spatial extension includes a function for reading Microsoft Excel XLSX files into DuckDB. This is because a lot of geospatial files are shared this way, but you can take advantage of this capability even if you have no spatial data!

Execute this SQL

install spatial;
load spatial;
from st_read('file.xlsx',layer='sheet_name');

Copy code

Expand

Share link


Query S3 Access LogsSQL

Editor's note: Want to read log files with DuckDB? You can use the read_csv function and custom date/time + regex parsing to do it. To make the data more useful, you can specifically CAST some of the values as numerical types. This snippet also shows CASE WHEN ELSE statements in action.

Execute this SQL


/* 
Background: If you have S3 Access Logging enabled on one of your S3 buckets, you'll have some useful information about requests to your bucket. Unfortunately, it's in a semistructured format that can be difficult to parse. This SQL query will can help in this manner, both pulling out individual fields and coersing them to native data types.

Usage: you'll want to search for the strings <bucket> and <prefix>, and insert the S3 bucket where your access logs are being delivered. Use (or delete) <prefix> to filter to a subset of your logs.

Also, these commented out configuration settings you can either run  yourself in the REPL and source this file using `.read parse_s3_access_logs.sql`, or you can uncomment them and supply values for yourself.
*/

-- install https;
-- load https;
-- SET s3_region='us-west-2';
-- SET s3_access_key_id='';
-- SET s3_secret_access_key='';

WITH parsed_logs AS (
    SELECT
        regexp_extract(col1, '^([0-9a-zA-Z]+)\s+([a-z0-9.\-]+)\s+\[([0-9/A-Za-z: +]+)\] ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ("[^"]*"|-) ([^ ]+) ([^ ]+) (\d+|-) (\d+|-) (\d+|-) (\d+|-) ("[^"]*"|-) ("[^"]*"|-) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+)(.*)$',
        ['bucket_owner', 'bucket', 'timestamp', 'remote_ip', 'request', 'request_id', 'operation', 's3_key', 'request_uri', 'http_status', 's3_errorcode', 'bytes_sent','object_size', 'total_time', 'turn_around_time', 'referer', 'user_agent', 'version_id', 'host_id', 'sigver', 'cyphersuite', 'auth_type', 'host_header', 'tls_version', 'access_point_arn', 'acl_required', 'extra']
    ) AS log_struct
    FROM
        -- Trick the CSV reader into reading as a single column
        read_csv(
            's3://<bucket>/<prefix>/*',
            columns={'col1': 'VARCHAR'},
            -- Use a *hopefully* nonsensical deliminator, so no ',' chars screw us up
            delim='\0'
        )
)
SELECT
        -- Grab everything from the struct that we want as strings, exclude stuff we'll coersce to diff types
        log_struct.* exclude (timestamp, bytes_sent, object_size, total_time, turn_around_time),
        strptime(log_struct.timestamp, '%d/%b/%Y:%H:%M:%S %z') AS timestamp,
        CASE
                WHEN log_struct.bytes_sent = '-' THEN NULL
                ELSE CAST(log_struct.bytes_sent AS INTEGER)
        END AS bytes_sent,
        CASE
                WHEN log_struct.object_size = '-' THEN NULL
                ELSE CAST(log_struct.object_size AS INTEGER)
        END AS object_size,
        CASE
                WHEN log_struct.total_time = '-' THEN NULL
                ELSE CAST(log_struct.total_time AS INTEGER)
        END AS total_time,
        CASE
                WHEN log_struct.turn_around_time = '-' THEN NULL
                ELSE CAST(log_struct.turn_around_time AS INTEGER)
        END AS turn_around_time
FROM parsed_logs;

Copy code

Mark Roddy

Expand

Share link


Put null values last when sorting (like Excel or Postgres)SQL

Editor's note: DuckDB enables you to configure whether NULL values are returned first or last in result sets by default. You can also specify it per query using NULLS LAST in the query ORDER BY clause. Note that NULLS LAST is now the default with 0.8.0+.

Execute this SQL

PRAGMA default_null_order='NULLS LAST';

Copy code

Matt Holden

Expand

Share link

Execute this SQL

-- create macro
CREATE OR REPLACE MACRO udf_products_in_year (v_year, v_category)
AS TABLE
SELECT 
	name, 
	category, 
	created_at
FROM products 
WHERE category = v_category
AND year(created_at) = v_year;

-- select using the macro as you would do from a table
SELECT *
FROM udf_products_in_year (2020, 'Home and Garden');

| Copper Light	| Home and Garden	| 2020-04-05 00:00:00.000 |
| Pink Armchair	| Home and Garden	| 2020-06-23 00:00:00.000 |

-- input ddl and data
CREATE TABLE products 
(
	name varchar,
	category varchar,
	created_at timestamp
);

INSERT INTO products
VALUES
('Cream Sofa', 'Home and Garden', '2019-03-14'),
('Copper Light', 'Home and Garden', '2020-04-05'),
('Pink Armchair', 'Home and Garden', '2020-06-23');

Copy code

Octavian Zarzu

Copy code

Expand

Share link

Execute this SQL

/* removes duplicate rows at the order_id level */
SELECT * FROM orders
QUALIFY row_number() over (partition by order_id order by created_at) = 1

Copy code

Octavian Zarzu

Copy code

Expand

Share link


Query Parquet data in S3Bash

Editor's note: DuckDB users often work with files in Parquet format, which has become a standard for representing data in data lakes. While DuckDB lets you work with local Parquet files, you can also use files stored in blob storage such as Amazon AWS S3, Azure Blob Storage and Google Cloud Storage.

Execute this Bash

# Assuming you have the following environment variables defined:
# AWS_ACCESS_KEY_ID
# AWS_SECRET_ACCESS_KEY
# AWS_DEFAULT_REGION
duckdb -c 'LOAD httpfs; SELECT count(*) FROM read_parquet("s3://<bucket>/<prefix>/*.parquet");'

Copy code

Damon

Expand

Share link


SUMMARIZESQL

Editor's note: the SUMMARIZE() function allows you to quickly understand your data. If you want to understand a little more about how it works under the hood, see Hamilton's other snippet on building your own SUMMARIZE() capabilities using built-in analytics functions.

Execute this SQL

SUMMARIZE some_tbl;
SUMMARIZE from 'some_file.csv';
--- summary informations about a given table

Copy code

Carlo Piovesan

Expand

Share link


Compute a metric for each numeric column and return the values in a long table (requires 0.7.2+)SQL

Editor's note: if your data originates as different types or in a format like CSV, you might want to do so without risking throwing an error for oddly-typed values. You can do so with TRY_CAST(), which will attempt a CAST but return NULL if not possible.

Execute this SQL

with computed as (
    select sum(try_cast(columns(*) as double))
    from read_csv_auto('aapl.csv')
)

select
    -- restore original column names
    trim(list_element(regexp_extract_all(name,'\.(.*?) AS',1),1),'"') as name,
    value
from (pivot_longer computed on *)

Copy code

Ben Ayre

Expand

Share link