Export/import/share DuckDB UI Notebooks
The DuckDB UI stores notebook content in an internal database called _duckdb_ui. You can query and export notebook content, as well as insert new definitions into the database. Warning: Modifying the internal database may lead to corruption and data loss. Be cautious and use it at your own risk!
Export a notebook definition to JSONSQL
copy ( select "json" from _duckdb_ui.notebook_versions where 1=1 and title = 'MySingleNotebook' and expires is null ) to 'exported-notebook.json'
Copy code
Import notebook definitionSQL
set variable notebook_content = (select json from 'exported-notebook.json'); set variable notebook_id = uuid(); set variable current_timestamp = now(); begin transaction; insert into _duckdb_ui.notebooks (id, name, created) select getvariable('notebook_id'), 'notebook_' || getvariable('notebook_id'), getvariable('current_timestamp') ; insert into _duckdb_ui.notebook_versions (notebook_id, version, title, json, created, expires) select getvariable('notebook_id'), 1, 'imported-notebook-' || getvariable('current_timestamp'), getvariable('notebook_content'), getvariable('current_timestamp'), null ; commit;
Copy code
Detect Schema Changes Across Datasets (Python)Python
Compare the schema of two datasets and identify any differences.
Execute this Python
import duckdb def compare_schemas(file1, file2): """ Compare schemas of two datasets and find differences. Args: file1 (str): Path to the first dataset (CSV/Parquet). file2 (str): Path to the second dataset (CSV/Parquet). Returns: list: Schema differences. """ con = duckdb.connect() schema1 = con.execute(f"DESCRIBE SELECT * FROM read_csv_auto('{file1}')").fetchall() schema2 = con.execute(f"DESCRIBE SELECT * FROM read_csv_auto('{file2}')").fetchall() return {"file1_schema": schema1, "file2_schema": schema2} # Example Usage differences = compare_schemas("data1.csv", "data2.csv") print(differences)
Copy code
Remove Duplicate Records from a CSV File (Bash)Bash
This function helps clean up a dataset by identifying and removing duplicate records. It’s especially useful for ensuring data integrity before analysis.
Execute this Bash
#!/bin/bash function remove_duplicates() { input_file="$1" # Input CSV file with duplicates output_file="$2" # Deduplicated output CSV file # Use DuckDB to remove duplicate rows and write the cleaned data to a new CSV file. duckdb -c "COPY (SELECT DISTINCT * FROM read_csv_auto('$input_file')) TO '$output_file' (FORMAT CSV, HEADER TRUE);" } #Usage remove_duplicates "input_data.csv" "cleaned_data.csv"
Copy code
Query JSON files Using SQL in PythonPython
DuckDB supports querying JSON files directly, enabling seamless analysis of semi-structured data. This script lets you apply SQL queries to JSON files within a Python environment, ideal for preprocessing or exploring JSON datasets.
Execute this Python
import duckdb def query_json(file_path, query): """ Query JSON data directly using DuckDB. Args: file_path (str): Path to the JSON file. query (str): SQL query to execute on the JSON data. Returns: pandas.DataFrame: Query results as a Pandas DataFrame. """ con = duckdb.connect() # Execute the query on the JSON file and fetch the results as a Pandas DataFrame. df = con.execute(f"SELECT * FROM read_json_auto('{file_path}') WHERE {query}").df() return df # Example Usage result = query_json("./json/query_20min.json", "scheduled = true") print(result)
Copy code
read_dsv() -> Parse properly separated CSV files
I tend to prefer using the ASCII unit (\x1f) and group separator (\x1e) as resp. column and line delimiters in CSVs (which technically no longer makes them a CSV). The read_csv function doesn't seem to want to play nice with these, so here's my attempt at a workaround.
Marco definitionSQL
-- For more info on DSVs (I'm not the author): https://matthodges.com/posts/2024-08-12-csv-bad-dsv-good/ CREATE OR REPLACE MACRO read_dsv(path_spec) AS TABLE ( with _lines as ( select filename ,regexp_split_to_array(content, '\x1e') as content from read_text(path_spec ) ) , _cleanup as ( select filename ,regexp_split_to_array(content[1],'\x1f') as header ,generate_subscripts(content[2:],1) as linenum ,unnest((content[2:]).list_filter(x -> trim(x) != '').list_transform(x -> x.regexp_split_to_array('\x1f'))) as line from _lines ) select filename ,linenum ,unnest(map_entries(map(header, line)), recursive := true) as kv from _cleanup );
Copy code
UsageSQL
-- You can use the same path specification as you would with read_text or read_csv, this includes globbing. -- Trying to include the pivot statement in the macro isn't possible, as you then have to explicitly define the column values (which defeats the purpose of this implementation) pivot read_dsv("C:\Temp\csv\*.csv") on key using first(value) group by filename, linenum order by filename, linenum
Copy code
Expand
Share link
Read Apache Iceberg to Google SheetsSQL
Sometimes you just need to get an Apache Iceberg table into Google Sheets for further analysis. The 'gsheet_id' can be found in the URL of your Google Sheet, and writes to the sheet with gid=0.
Execute this SQL
-- get iceberg extension INSTALL iceberg; LOAD iceberg; -- get gsheets extension INSTALL gsheets FROM community; LOAD gsheets; -- authenticate to google sheets CREATE SECRET (TYPE gsheet); -- copy the iceberg data to your google sheet! COPY (from iceberg_scan('s3://my-bucket/iceberg_table')) TO ‘gsheet_id’ (FORMAT gsheet);
Copy code
Expand
Share link
Label columns based on source tableSQL
Commonly, tables that are joined together have overlapping column names. This snippet will rename all columns to have a prefix based on the source table. No more duplicate names! This is similar to the Pandas join feature of lsuffix and rsuffix.
Execute this SQL
SELECT COLUMNS(t1.*) AS 't1_\0', COLUMNS(t2.*) AS 't2_\0' FROM range(10) t1 JOIN range(10) t2 ON t1.range = t2.range
Copy code
Expand
Share link