Detect Schema Changes Across Datasets (Python)Python

Compare the schema of two datasets and identify any differences.

Execute this Python

import duckdb

def compare_schemas(file1, file2):
    """
    Compare schemas of two datasets and find differences.
    Args:
        file1 (str): Path to the first dataset (CSV/Parquet).
        file2 (str): Path to the second dataset (CSV/Parquet).
    Returns:
        list: Schema differences.
    """
    con = duckdb.connect()
    schema1 = con.execute(f"DESCRIBE SELECT * FROM read_csv_auto('{file1}')").fetchall()
    schema2 = con.execute(f"DESCRIBE SELECT * FROM read_csv_auto('{file2}')").fetchall()
    return {"file1_schema": schema1, "file2_schema": schema2}

# Example Usage
differences = compare_schemas("data1.csv", "data2.csv")
print(differences)

Copy code

Expand

Share link


Remove Duplicate Records from a CSV File (Bash)Bash

This function helps clean up a dataset by identifying and removing duplicate records. It’s especially useful for ensuring data integrity before analysis.

Execute this Bash

#!/bin/bash
function remove_duplicates() {
    input_file="$1"  # Input CSV file with duplicates
    output_file="$2" # Deduplicated output CSV file
    # Use DuckDB to remove duplicate rows and write the cleaned data to a new CSV file.
    duckdb -c "COPY (SELECT DISTINCT * FROM read_csv_auto('$input_file')) TO '$output_file' (FORMAT CSV, HEADER TRUE);"
}

#Usage remove_duplicates "input_data.csv" "cleaned_data.csv"

Copy code

Expand

Share link


Query JSON files Using SQL in PythonPython

DuckDB supports querying JSON files directly, enabling seamless analysis of semi-structured data. This script lets you apply SQL queries to JSON files within a Python environment, ideal for preprocessing or exploring JSON datasets.

Execute this Python

import duckdb

def query_json(file_path, query):
    """
    Query JSON data directly using DuckDB.
    Args:
        file_path (str): Path to the JSON file.
        query (str): SQL query to execute on the JSON data.
    Returns:
        pandas.DataFrame: Query results as a Pandas DataFrame.
    """
    con = duckdb.connect()
    # Execute the query on the JSON file and fetch the results as a Pandas DataFrame.
    df = con.execute(f"SELECT * FROM read_json_auto('{file_path}') WHERE {query}").df()
    return df

# Example Usage
result = query_json("./json/query_20min.json", "scheduled = true")
print(result)

Copy code

Expand

Share link


read_dsv() -> Parse properly separated CSV files

I tend to prefer using the ASCII unit (\x1f) and group separator (\x1e) as resp. column and line delimiters in CSVs (which technically no longer makes them a CSV). The read_csv function doesn't seem to want to play nice with these, so here's my attempt at a workaround.

Marco definitionSQL

-- For more info on DSVs (I'm not the author): https://matthodges.com/posts/2024-08-12-csv-bad-dsv-good/
CREATE OR REPLACE MACRO read_dsv(path_spec)
 AS TABLE
(
with _lines as (
    select 
        filename
        ,regexp_split_to_array(content, '\x1e') as content
    from read_text(path_spec )
)
, _cleanup as (
    select 
        filename
        ,regexp_split_to_array(content[1],'\x1f') as header
        ,generate_subscripts(content[2:],1) as linenum
        ,unnest((content[2:]).list_filter(x -> trim(x) != '').list_transform(x -> x.regexp_split_to_array('\x1f'))) as line
    from _lines
)
select
    filename
    ,linenum
    ,unnest(map_entries(map(header, line)), recursive := true) as kv
from _cleanup
);

Copy code

UsageSQL

-- You can use the same path specification as you would with read_text or read_csv, this includes globbing.
-- Trying to include the pivot statement in the macro isn't possible, as you then have to explicitly define the column values (which defeats the purpose of this implementation)
pivot read_dsv("C:\Temp\csv\*.csv")
on key
using first(value)
group by filename, linenum
order by filename, linenum

Copy code

DuckØ

Expand

Share link


Label columns based on source tableSQL

Commonly, tables that are joined together have overlapping column names. This snippet will rename all columns to have a prefix based on the source table. No more duplicate names! This is similar to the Pandas join feature of lsuffix and rsuffix.

Execute this SQL

SELECT 
    COLUMNS(t1.*) AS 't1_\0',
    COLUMNS(t2.*) AS 't2_\0'
FROM range(10) t1 
JOIN range(10) t2 ON t1.range = t2.range

Copy code

Alex Monahan

Expand

Share link


Generate series of numbers in DuckDB

DuckDB has two common ways to generate a series of numbers: the range() function and the generate_series() function. They differ only in that the generate_series() function has a 'stop' value that's inclusive, while the 'stop' value of range() is exclusive.

generate_series with inclusive stop valueSQL

// generate_series(start, stop, step)
// get all even numbers, starting at 0 up to and including 100
SELECT * FROM generate_series(0,100,2);

Copy code

range with exclusive stop valueSQL

// range(start, stop, step)
// get all even numbers, starting at 0 up to and including 98
SELECT * FROM range(0,100,2);

Copy code

Generate range() as arraySQL

// Using range() as a column value instead of a table
// in your SQL statement will return an array of the
// numbers in the range
SELECT range(0,100,2)

Copy code

Ryan Boyd

Expand

Share link