Aggregate rows into a sorted list.SQL

Execute this SQL

-- list and array_agg take in their own ORDER BY clause, so that you
-- can sort the aggregate. The statements order by cannot be used 
-- as the columns that are used for sorting then need to be a grouping
-- key and cannot be used in the aggregate

SELECT name, 
       -- Order the aggregated list by another column from line_items  
       list(item_name ORDER BY pos ASC) items 
FROM orders 
JOIN line_items ON order_id = id 
GROUP BY ALL
-- Order by grouping keys is ofc possible
ORDER BY name;

Copy code

Michael Simons

Copy code

Expand

Share link


Allow unsigned extensions (own extensions or untrusted third-parties)

Editor's note: While DuckDB has 20+ official extensions, it also allows you to write your own extensions. In order load extensions from unofficial repositories, you'll need to let DuckDB know that it's safe to do so. This can be done using the allow_unsigned_extensions flag or -unsigned for the CLI.

Execute this Bash

duckdb -unsigned

Copy code

Mehdi Ouazza

Expand

Share link


Boxplot quantiles all at once

Editor's note: with a variety of supported statistical functions, including things like sampling and producing quantiles, DuckDB provides great analytical capabilities not always present in other databases.

Execute this SQL

select 
    unnest([.1,.25,.5,.75,.9]) as quantile,
    unnest(quantile_cont(i, [.1,.25,.5,.75,.9])) as value 
from range(101) t(i);

/* This returns:
| quantile | value |
|     0.10 |    10 |
|     0.25 |    25 |
|     0.50 |    50 |
|     0.75 |    75 |
|     0.90 |    90 |
*/

Copy code

Alex Monahan

Expand

Share link


Convert CSV to Parquet and provide schema to use

Editor's note: while there are other snippets showing file conversion, Parth's shows you how to convert from CSV to Parquet files using DuckDB with specification of the entire schema (columns) and compression codec.

Execute this Bash

duckdb -c "COPY (SELECT * FROM read_csv('pageviews-sanitized-20230101-000000.csv', delim=' ', header=False, columns={'domain_code': 'VARCHAR', 'page_title': 'VARCHAR', 'count_views': 'UINTEGER', 'total_response_size': 'UINTEGER'})) TO 'pageviews-sanitized-20230101-000000.parquet' (FORMAT 'PARQUET', CODEC 'zstd')"

Copy code

Parth Patil

Expand

Share link


Compute a metric for each numeric column and return the values in a long table (requires 0.7.2+)

Editor's note: if your data originates as different types or in a format like CSV, you might want to do so without risking throwing an error for oddly-typed values. You can do so with TRY_CAST(), which will attempt a CAST but return NULL if not possible.

Execute this SQL

with computed as (
    select sum(try_cast(columns(*) as double))
    from read_csv_auto('aapl.csv')
)

select
    -- restore original column names
    trim(list_element(regexp_extract_all(name,'\.(.*?) AS',1),1),'"') as name,
    value
from (pivot_longer computed on *)

Copy code

Ben Ayre

Expand

Share link


Query Parquet data in S3

Editor's note: DuckDB users often work with files in Parquet format, which has become a standard for representing data in data lakes. While DuckDB lets you work with local Parquet files, you can also use files stored in blob storage such as Amazon AWS S3, Azure Blob Storage and Google Cloud Storage.

Execute this Bash

# Assuming you have the following environment variables defined:
# AWS_ACCESS_KEY_ID
# AWS_SECRET_ACCESS_KEY
# AWS_DEFAULT_REGION
duckdb -c 'LOAD httpfs; SELECT count(*) FROM read_parquet("s3://<bucket>/<prefix>/*.parquet");'

Copy code

Damon

Expand

Share link


Read an MS Excel File with the spatial extension

Editor's note: It might seem a bit odd, but the DuckDB spatial extension includes a function for reading Microsoft Excel XLSX files into DuckDB. This is because a lot of geospatial files are shared this way, but you can take advantage of this capability even if you have no spatial data!

Execute this SQL

install spatial;
load spatial;
from st_read('file.xlsx',layer='sheet_name');

Copy code

Expand

Share link


DuckDB in Action, Examples from chapters 3 and 4: Having fun with power production measurements

Editor's note: the Manning book "DuckDB in Action" has tons of great SQL exercises, including a few chapters which work on power production data. Michael, one of the authors, shares some of these queries here [and on other snippets published on this site].

Attach and select MotherDuck database

Data shared/available on MotherDuck

ATTACH 'md:_share/duckdb_in_action_ch3_4/d0c08584-1d33-491c-8db7-cf9c6910eceb' AS duckdb_in_action_ch3_4;
USE duckdb_in_action_ch3_4;

Copy code

3.17 Using arg_max to find sibling values of aggregated values computed in a common table expressionSQL

WITH per_hour AS (
    SELECT system_id,
           date_trunc('hour', read_on) AS read_on,
           avg(power) / 1000 AS kWh
    FROM readings
    GROUP BY ALL
)
SELECT name,
       max(kWh),
       arg_max(read_on, kWh) AS 'Read on'
FROM per_hour
   JOIN systems s ON s.id = per_hour.system_id
WHERE system_id = 34
GROUP by s.name;

Copy code

4.18 Using named windows defined with a partition and a range to compute several aggregates at onceSQL

SELECT system_id,
       day,
       min(kWh) OVER seven_days AS "7-day min",
       quantile(kWh, [0.25, 0.5, 0.75])
         OVER seven_days AS "kWh 7-day quartile",
       max(kWh) OVER seven_days AS "7-day max",
FROM v_power_per_day
WINDOW
    seven_days AS (
        PARTITION BY system_id, month(day)
        ORDER BY day ASC
        RANGE BETWEEN INTERVAL 3 Days PRECEDING
                  AND INTERVAL 3 Days FOLLOWING
    )
ORDER BY system_id, day;

Copy code

4.23 Filtering outside the where clause to find a time period with a high production over 7 daysSQL

SELECT system_id,
       day,
       avg(kWh) OVER (
            PARTITION BY system_id
            ORDER BY day ASC
            RANGE BETWEEN INTERVAL 3 Days PRECEDING
                      AND INTERVAL 3 Days FOLLOWING
       ) AS "kWh 7-day moving average"
FROM v_power_per_day
QUALIFY "kWh 7-day moving average" > 875 
ORDER BY system_id, day;

Copy code

4.26 Easily pivoting a result set (around a year)SQL

PIVOT (FROM v_power_per_day)
ON year(day)
USING sum(kWh);

Copy code

4.28 Using the ASOF join to pick prices that have been valid up to the point in time of sellingSQL

WITH prices AS (
  SELECT range AS valid_at,
         random()*10 AS price
  FROM range(
       '2023-01-01 01:00:00'::timestamp,
       '2023-01-01 02:00:00'::timestamp, INTERVAL '15 minutes')
),
sales AS (
  SELECT range AS sold_at,
         random()*10 AS num
  FROM range(
      '2023-01-01 01:00:00'::timestamp,
      '2023-01-01 02:00:00'::timestamp, INTERVAL '5 minutes')
)
SELECT sold_at, valid_at AS 'with_price_at', round(num * price,2) as price
FROM sales
ASOF JOIN prices
    ON prices.valid_at <= sales.sold_at;

Copy code

4.29 Using the ASOF join together with a window function to compute accumulated earningsSQL

SELECT power.day,
       power.kWh,
       prices.value as 'ct/kWh',
       round(sum(prices.value * power.kWh)
           OVER (ORDER BY power.day ASC) / 100, 2)
           AS 'Accumulated earnings in EUR'
FROM v_power_per_day power
    ASOF JOIN prices
    ON prices.valid_from <= power.day
WHERE system_id = 34
ORDER BY day;

Copy code

4.33 Using a LATERAL Join for looping through a JSON array returned by a solar radiation APISQL

-- The below query generates a series of 7 days, 
-- joins those with the hours 8, 13 and 19 (7pm) to create indexes.
-- Those indexes are the day number * 24 plus the desired hour of the
-- day to find the value in the JSON array.
-- That index is the lateral driver for the sub-query:
--
-- It then calls Open Meteo for Offenbach, apparently the city with 
-- the most sunshine in germany to get shortwave radition 
-- at those times
WITH days AS (
  SELECT generate_series AS value FROM generate_series(7)
), hours AS (
  SELECT unnest([8, 13, 18]) AS value
), indexes AS (
  SELECT days.value * 24 + hours.value AS i
  FROM days, hours
)
SELECT date_trunc('day', now()) - INTERVAL '7 days' +
         INTERVAL (indexes.i || ' hours') AS ts,
       ghi.v AS 'GHI in W/m^2'
FROM indexes,
LATERAL (
  SELECT hourly.shortwave_radiation_instant[i+1] AS v
  FROM read_json_auto('https://api.open-meteo.com/v1/forecast?latitude=48.47377&longitude=7.94495&hourly=shortwave_radiation_instant&past_days=7')
) AS ghi
ORDER BY ts;

Copy code

Show less
Michael Simons

Expand

Share link