Detect Schema Changes Across Datasets (Python)Python

Compare the schema of two datasets and identify any differences.

Execute this Python

import duckdb

def compare_schemas(file1, file2):
    """
    Compare schemas of two datasets and find differences.
    Args:
        file1 (str): Path to the first dataset (CSV/Parquet).
        file2 (str): Path to the second dataset (CSV/Parquet).
    Returns:
        list: Schema differences.
    """
    con = duckdb.connect()
    schema1 = con.execute(f"DESCRIBE SELECT * FROM read_csv_auto('{file1}')").fetchall()
    schema2 = con.execute(f"DESCRIBE SELECT * FROM read_csv_auto('{file2}')").fetchall()
    return {"file1_schema": schema1, "file2_schema": schema2}

# Example Usage
differences = compare_schemas("data1.csv", "data2.csv")
print(differences)

Copy code

Expand

Share link


Remove Duplicate Records from a CSV File (Bash)Bash

This function helps clean up a dataset by identifying and removing duplicate records. It’s especially useful for ensuring data integrity before analysis.

Execute this Bash

#!/bin/bash
function remove_duplicates() {
    input_file="$1"  # Input CSV file with duplicates
    output_file="$2" # Deduplicated output CSV file
    # Use DuckDB to remove duplicate rows and write the cleaned data to a new CSV file.
    duckdb -c "COPY (SELECT DISTINCT * FROM read_csv_auto('$input_file')) TO '$output_file' (FORMAT CSV, HEADER TRUE);"
}

#Usage remove_duplicates "input_data.csv" "cleaned_data.csv"

Copy code

Expand

Share link


Query JSON files Using SQL in PythonPython

DuckDB supports querying JSON files directly, enabling seamless analysis of semi-structured data. This script lets you apply SQL queries to JSON files within a Python environment, ideal for preprocessing or exploring JSON datasets.

Execute this Python

import duckdb

def query_json(file_path, query):
    """
    Query JSON data directly using DuckDB.
    Args:
        file_path (str): Path to the JSON file.
        query (str): SQL query to execute on the JSON data.
    Returns:
        pandas.DataFrame: Query results as a Pandas DataFrame.
    """
    con = duckdb.connect()
    # Execute the query on the JSON file and fetch the results as a Pandas DataFrame.
    df = con.execute(f"SELECT * FROM read_json_auto('{file_path}') WHERE {query}").df()
    return df

# Example Usage
result = query_json("./json/query_20min.json", "scheduled = true")
print(result)

Copy code

Expand

Share link

Execute this SQL

--Below SQL to generate MD5 of a string

select md5('test')

Copy code

Expand

Share link