SQL, Python & More for DuckDB - Page 6

‌

Add a unique column (key) to an existing table in DuckDBSQL

We can use DuckDB's `SEQUENCE` to add a unique key to a table.Many source data files (such as CSV files) do not include a unique key.

Execute this SQL

CREATE TABLE tbl (s VARCHAR);

INSERT INTO tbl VALUES ( 'hello'),('world');

CREATE SEQUENCE id_sequence;

ALTER TABLE tbl ADD COLUMN id INTEGER DEFAULT nextval ('id_sequence');

SELECT * FROM tbl;
Copy code

colorcrow

Created 08/31/24

‌

Reading a fixed-width file in DuckDBSQL

Fixed-width files can be little difficult but IF you consider each line of data as a string which you can attack with duckdb and substr() its not that difficult ;)

Execute this SQL

CREATE OR REPLACE TABLE example_table AS
SELECT 
    CAST(substr(line, 1, 4) AS INTEGER) AS activity_year,
    CAST(substr(line, 5, 10) AS VARCHAR(10)) AS lei_or_respondent_id,
    CAST(substr(line, 15, 1) AS CHAR(1)) AS agency_code,
    CAST(substr(line, 16, 1) AS CHAR(1)) AS loan_type,
    CAST(substr(line, 17, 1) AS CHAR(1)) AS loan_purpose,
    CAST(substr(line, 18, 1) AS CHAR(1)) AS occupancy_type
FROM
    (SELECT column0 AS line FROM read_csv('data.txt', AUTO_DETECT=TRUE, skip=1));
Copy code

Chetan Amrao

Edited 10/02/24

‌

Label columns based on source tableSQL

Commonly, tables that are joined together have overlapping column names. This snippet will rename all columns to have a prefix based on the source table. No more duplicate names! This is similar to the Pandas join feature of lsuffix and rsuffix.

Execute this SQL

SELECT 
    COLUMNS(t1.*) AS 't1_\0',
    COLUMNS(t2.*) AS 't2_\0'
FROM range(10) t1 
JOIN range(10) t2 ON t1.range = t2.range
Copy code

Alex Monahan

Created 11/21/24

‌

Aggregate rows into a sorted list.SQL

Execute this SQL

-- list and array_agg take in their own ORDER BY clause, so that you
-- can sort the aggregate. The statements order by cannot be used 
-- as the columns that are used for sorting then need to be a grouping
-- key and cannot be used in the aggregate

SELECT name, 
       -- Order the aggregated list by another column from line_items  
       list(item_name ORDER BY pos ASC) items 
FROM orders 
JOIN line_items ON order_id = id 
GROUP BY ALL
-- Order by grouping keys is ofc possible
ORDER BY name;
Copy code

Michael Simons

Created 12/18/23

‌

Sort by WeekdaysSQL

Execute this SQL

-- duckdb has only weekday-name not number
-- create enum in the right order
create type Weekday as enum ('Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday');

-- cast the string to enum for sorting
select count(*), cast(dayname(CreationDate) as Weekday) as day
from posts where posttypeid = 1 and tags like '%>sql<%'
group by all
order by day asc;
-- doing the cast in order by causes a bug

-- ┌──────────────┬───────────┐
-- │ count_star() │    day    │
-- │    int64     │  weekday  │
-- ├──────────────┼───────────┤
-- │       103937 │ Monday    │
-- │       115575 │ Tuesday   │
-- │       119825 │ Wednesday │
-- │       119514 │ Thursday  │
-- │       103445 │ Friday    │
-- │        47139 │ Saturday  │
-- │        47390 │ Sunday    │
-- └──────────────┴───────────┘
Copy code

Michael Hunger

Created 07/15/23

‌

Allow unsigned extensions (own extensions or untrusted third-parties)Bash

Editor's note: While DuckDB has 20+ official extensions, it also allows you to write your own extensions. In order load extensions from unofficial repositories, you'll need to let DuckDB know that it's safe to do so. This can be done using the allow_unsigned_extensions flag or -unsigned for the CLI.

Execute this Bash

duckdb -unsigned
Copy code

Mehdi Ouazza

Edited 02/23/24

‌

Boxplot quantiles all at onceSQL

Editor's note: with a variety of supported statistical functions, including things like sampling and producing quantiles, DuckDB provides great analytical capabilities not always present in other databases.

Execute this SQL

select 
    unnest([.1,.25,.5,.75,.9]) as quantile,
    unnest(quantile_cont(i, [.1,.25,.5,.75,.9])) as value 
from range(101) t(i);

/* This returns:
| quantile | value |
|     0.10 |    10 |
|     0.25 |    25 |
|     0.50 |    50 |
|     0.75 |    75 |
|     0.90 |    90 |
*/
Copy code

Alex Monahan

Edited 02/23/24

‌

Convert CSV to Parquet and provide schema to useBash

Editor's note: while there are other snippets showing file conversion, Parth's shows you how to convert from CSV to Parquet files using DuckDB with specification of the entire schema (columns) and compression codec.

Execute this Bash

duckdb -c "COPY (SELECT * FROM read_csv('pageviews-sanitized-20230101-000000.csv', delim=' ', header=False, columns={'domain_code': 'VARCHAR', 'page_title': 'VARCHAR', 'count_views': 'UINTEGER', 'total_response_size': 'UINTEGER'})) TO 'pageviews-sanitized-20230101-000000.parquet' (FORMAT 'PARQUET', CODEC 'zstd')"
Copy code

Parth Patil

Edited 02/23/24