Detect Schema Changes Across Datasets (Python)Python
Compare the schema of two datasets and identify any differences.
Execute this Python
import duckdb def compare_schemas(file1, file2): """ Compare schemas of two datasets and find differences. Args: file1 (str): Path to the first dataset (CSV/Parquet). file2 (str): Path to the second dataset (CSV/Parquet). Returns: list: Schema differences. """ con = duckdb.connect() schema1 = con.execute(f"DESCRIBE SELECT * FROM read_csv_auto('{file1}')").fetchall() schema2 = con.execute(f"DESCRIBE SELECT * FROM read_csv_auto('{file2}')").fetchall() return {"file1_schema": schema1, "file2_schema": schema2} # Example Usage differences = compare_schemas("data1.csv", "data2.csv") print(differences)
Copy code
Remove Duplicate Records from a CSV File (Bash)Bash
This function helps clean up a dataset by identifying and removing duplicate records. It’s especially useful for ensuring data integrity before analysis.
Execute this Bash
#!/bin/bash function remove_duplicates() { input_file="$1" # Input CSV file with duplicates output_file="$2" # Deduplicated output CSV file # Use DuckDB to remove duplicate rows and write the cleaned data to a new CSV file. duckdb -c "COPY (SELECT DISTINCT * FROM read_csv_auto('$input_file')) TO '$output_file' (FORMAT CSV, HEADER TRUE);" } #Usage remove_duplicates "input_data.csv" "cleaned_data.csv"
Copy code
Query JSON files Using SQL in PythonPython
DuckDB supports querying JSON files directly, enabling seamless analysis of semi-structured data. This script lets you apply SQL queries to JSON files within a Python environment, ideal for preprocessing or exploring JSON datasets.
Execute this Python
import duckdb def query_json(file_path, query): """ Query JSON data directly using DuckDB. Args: file_path (str): Path to the JSON file. query (str): SQL query to execute on the JSON data. Returns: pandas.DataFrame: Query results as a Pandas DataFrame. """ con = duckdb.connect() # Execute the query on the JSON file and fetch the results as a Pandas DataFrame. df = con.execute(f"SELECT * FROM read_json_auto('{file_path}') WHERE {query}").df() return df # Example Usage result = query_json("./json/query_20min.json", "scheduled = true") print(result)
Copy code
Execute this SQL
--Below SQL to generate MD5 of a string select md5('test')
Copy code