SQL, Python & More for DuckDB - Page 12

‌

Run SQL file in DuckDB CLI

The DuckDB CLI enables you to execute a set of SQL statements stored in a file or passed in via STDIN. There are a few variants of this capability demonstrated below.

Read and execute SQL using init CLI argument and prompt for additional SQL statementsBash

# executes SQL in create.sql, and then prompts for additional 
# SQL statements provided interactively. note that when specifying
# an init flag, the ~/.duckdbrc file is not read
duckdb -init create.sql
Copy code

Read and execute SQL using init CLI argument and immediately exit Bash

# executes SQL in create.sql and then immediately exits
# note that we're specifying a database name so that we
# can access the created data later. note that when specifying
# an init flag, the ~/.duckdbrc file is not read
duckdb -init create.sql -no-stdin mydb.ddb
Copy code

Pipe SQL file to the DuckDB CLI and exitBash

duckdb < create.sql mydb.ddb
Copy code

Ryan Boyd

Edited 11/02/24

‌

Read Apache Iceberg to Google SheetsSQL

Sometimes you just need to get an Apache Iceberg table into Google Sheets for further analysis. The 'gsheet_id' can be found in the URL of your Google Sheet, and writes to the sheet with gid=0.

Execute this SQL

-- get iceberg extension
INSTALL iceberg;
LOAD iceberg;
-- get gsheets extension
INSTALL gsheets FROM community;
LOAD gsheets;
-- authenticate to google sheets
CREATE SECRET (TYPE gsheet);
-- copy the iceberg data to your google sheet!
COPY (from iceberg_scan('s3://my-bucket/iceberg_table')) TO ‘gsheet_id’ (FORMAT gsheet);
Copy code

Jacob Matson

Created 11/22/24

‌

read_dsv() -> Parse properly separated CSV files

I tend to prefer using the ASCII unit (\x1f) and group separator (\x1e) as resp. column and line delimiters in CSVs (which technically no longer makes them a CSV). The read_csv function doesn't seem to want to play nice with these, so here's my attempt at a workaround.

Marco definitionSQL

-- For more info on DSVs (I'm not the author): https://matthodges.com/posts/2024-08-12-csv-bad-dsv-good/
CREATE OR REPLACE MACRO read_dsv(path_spec)
 AS TABLE
(
with _lines as (
    select 
        filename
        ,regexp_split_to_array(content, '\x1e') as content
    from read_text(path_spec )
)
, _cleanup as (
    select 
        filename
        ,regexp_split_to_array(content[1],'\x1f') as header
        ,generate_subscripts(content[2:],1) as linenum
        ,unnest((content[2:]).list_filter(x -> trim(x) != '').list_transform(x -> x.regexp_split_to_array('\x1f'))) as line
    from _lines
)
select
    filename
    ,linenum
    ,unnest(map_entries(map(header, line)), recursive := true) as kv
from _cleanup
);
Copy code

UsageSQL

-- You can use the same path specification as you would with read_text or read_csv, this includes globbing.
-- Trying to include the pivot statement in the macro isn't possible, as you then have to explicitly define the column values (which defeats the purpose of this implementation)
pivot read_dsv("C:\Temp\csv\*.csv")
on key
using first(value)
group by filename, linenum
order by filename, linenum
Copy code

DuckØ

Created 11/30/24

‌

Query JSON files Using SQL in PythonPython

DuckDB supports querying JSON files directly, enabling seamless analysis of semi-structured data. This script lets you apply SQL queries to JSON files within a Python environment, ideal for preprocessing or exploring JSON datasets.

Execute this Python

import duckdb

def query_json(file_path, query):
    """
    Query JSON data directly using DuckDB.
    Args:
        file_path (str): Path to the JSON file.
        query (str): SQL query to execute on the JSON data.
    Returns:
        pandas.DataFrame: Query results as a Pandas DataFrame.
    """
    con = duckdb.connect()
    # Execute the query on the JSON file and fetch the results as a Pandas DataFrame.
    df = con.execute(f"SELECT * FROM read_json_auto('{file_path}') WHERE {query}").df()
    return df

# Example Usage
result = query_json("./json/query_20min.json", "scheduled = true")
print(result)
Copy code

Created 12/12/24

‌

Remove Duplicate Records from a CSV File (Bash)Bash

This function helps clean up a dataset by identifying and removing duplicate records. It’s especially useful for ensuring data integrity before analysis.

Execute this Bash

#!/bin/bash
function remove_duplicates() {
    input_file="$1"  # Input CSV file with duplicates
    output_file="$2" # Deduplicated output CSV file
    # Use DuckDB to remove duplicate rows and write the cleaned data to a new CSV file.
    duckdb -c "COPY (SELECT DISTINCT * FROM read_csv_auto('$input_file')) TO '$output_file' (FORMAT CSV, HEADER TRUE);"
}

#Usage remove_duplicates "input_data.csv" "cleaned_data.csv"
Copy code

Created 12/12/24