Convert CSV to Parquet and provide schema to useBash
Editor's note: while there are other snippets showing file conversion, Parth's shows you how to convert from CSV to Parquet files using DuckDB with specification of the entire schema (columns) and compression codec.
Execute this Bash
duckdb -c "COPY (SELECT * FROM read_csv('pageviews-sanitized-20230101-000000.csv', delim=' ', header=False, columns={'domain_code': 'VARCHAR', 'page_title': 'VARCHAR', 'count_views': 'UINTEGER', 'total_response_size': 'UINTEGER'})) TO 'pageviews-sanitized-20230101-000000.parquet' (FORMAT 'PARQUET', CODEC 'zstd')"
Copy code
Expand
Share link
Compute a metric for each numeric column and return the values in a long table (requires 0.7.2+)SQL
Editor's note: if your data originates as different types or in a format like CSV, you might want to do so without risking throwing an error for oddly-typed values. You can do so with TRY_CAST(), which will attempt a CAST but return NULL if not possible.
Execute this SQL
with computed as ( select sum(try_cast(columns(*) as double)) from read_csv_auto('aapl.csv') ) select -- restore original column names trim(list_element(regexp_extract_all(name,'\.(.*?) AS',1),1),'"') as name, value from (pivot_longer computed on *)
Copy code
Expand
Share link
DuckDB in Action, Examples from chapters 3 and 4: Having fun with power production measurements
Editor's note: the Manning book "DuckDB in Action" has tons of great SQL exercises, including a few chapters which work on power production data. Michael, one of the authors, shares some of these queries here [and on other snippets published on this site].
3.17 Using arg_max to find sibling values of aggregated values computed in a common table expressionSQL
WITH per_hour AS ( SELECT system_id, date_trunc('hour', read_on) AS read_on, avg(power) / 1000 AS kWh FROM readings GROUP BY ALL ) SELECT name, max(kWh), arg_max(read_on, kWh) AS 'Read on' FROM per_hour JOIN systems s ON s.id = per_hour.system_id WHERE system_id = 34 GROUP by s.name;
Copy code
4.18 Using named windows defined with a partition and a range to compute several aggregates at onceSQL
SELECT system_id, day, min(kWh) OVER seven_days AS "7-day min", quantile(kWh, [0.25, 0.5, 0.75]) OVER seven_days AS "kWh 7-day quartile", max(kWh) OVER seven_days AS "7-day max", FROM v_power_per_day WINDOW seven_days AS ( PARTITION BY system_id, month(day) ORDER BY day ASC RANGE BETWEEN INTERVAL 3 Days PRECEDING AND INTERVAL 3 Days FOLLOWING ) ORDER BY system_id, day;
Copy code
4.23 Filtering outside the where clause to find a time period with a high production over 7 daysSQL
SELECT system_id, day, avg(kWh) OVER ( PARTITION BY system_id ORDER BY day ASC RANGE BETWEEN INTERVAL 3 Days PRECEDING AND INTERVAL 3 Days FOLLOWING ) AS "kWh 7-day moving average" FROM v_power_per_day QUALIFY "kWh 7-day moving average" > 875 ORDER BY system_id, day;
Copy code
4.26 Easily pivoting a result set (around a year)SQL
PIVOT (FROM v_power_per_day) ON year(day) USING sum(kWh);
Copy code
4.28 Using the ASOF join to pick prices that have been valid up to the point in time of sellingSQL
WITH prices AS ( SELECT range AS valid_at, random()*10 AS price FROM range( '2023-01-01 01:00:00'::timestamp, '2023-01-01 02:00:00'::timestamp, INTERVAL '15 minutes') ), sales AS ( SELECT range AS sold_at, random()*10 AS num FROM range( '2023-01-01 01:00:00'::timestamp, '2023-01-01 02:00:00'::timestamp, INTERVAL '5 minutes') ) SELECT sold_at, valid_at AS 'with_price_at', round(num * price,2) as price FROM sales ASOF JOIN prices ON prices.valid_at <= sales.sold_at;
Copy code
4.29 Using the ASOF join together with a window function to compute accumulated earningsSQL
SELECT power.day, power.kWh, prices.value as 'ct/kWh', round(sum(prices.value * power.kWh) OVER (ORDER BY power.day ASC) / 100, 2) AS 'Accumulated earnings in EUR' FROM v_power_per_day power ASOF JOIN prices ON prices.valid_from <= power.day WHERE system_id = 34 ORDER BY day;
Copy code
4.33 Using a LATERAL Join for looping through a JSON array returned by a solar radiation APISQL
-- The below query generates a series of 7 days, -- joins those with the hours 8, 13 and 19 (7pm) to create indexes. -- Those indexes are the day number * 24 plus the desired hour of the -- day to find the value in the JSON array. -- That index is the lateral driver for the sub-query: -- -- It then calls Open Meteo for Offenbach, apparently the city with -- the most sunshine in germany to get shortwave radition -- at those times WITH days AS ( SELECT generate_series AS value FROM generate_series(7) ), hours AS ( SELECT unnest([8, 13, 18]) AS value ), indexes AS ( SELECT days.value * 24 + hours.value AS i FROM days, hours ) SELECT date_trunc('day', now()) - INTERVAL '7 days' + INTERVAL (indexes.i || ' hours') AS ts, ghi.v AS 'GHI in W/m^2' FROM indexes, LATERAL ( SELECT hourly.shortwave_radiation_instant[i+1] AS v FROM read_json_auto('https://api.open-meteo.com/v1/forecast?latitude=48.47377&longitude=7.94495&hourly=shortwave_radiation_instant&past_days=7') ) AS ghi ORDER BY ts;
Copy code
Expand
Share link