SQL, Python & More for DuckDB

‌

Quickly convert a CSV to Parquet, bash functionBash

Editor's note: DuckDB makes it easy to convert between a variety of popular data formats (CSV, JSON, Parquet, and more) using simple SQL statements. It's also easy to execute these statements from a bash shell so you have them ready to go.

Execute this Bash

#!/bin/bash 
function csv_to_parquet() {     
    file_path="$1"     
    duckdb -c "COPY (SELECT * FROM read_csv_auto('$file_path')) TO '${file_path%.*}.parquet' (FORMAT PARQUET);" }
Copy code

Mehdi Ouazza

Edited 02/23/24

‌

Filter column names using a patternSQL

Editor's note: DuckDB aims to make SQL even easier, while supporting standards whenever possible. When you have extremely wide tables, it's often helpful to return only columns matching a regex, and COLUMNS() does exactly that. With the EXCLUDE() and REPLACE() functions you get even more simplicity.

Execute this SQL

-- select only the column names that start with the dim_
SELECT COLUMNS('^dim_') FROM fact_table;
Copy code

Octavian Zarzu

Edited 02/23/24

‌

Query the Output of Another ProcessBash

Editor's note: if you're executing command-line interfaces that output JSON, CSV or other common formats, DuckDB enables you to do ad hoc queries on the results using SQL.

Execute this Bash

# Only using JSON as an example format, not required.
cat some-file.json | duckdb -s "SELECT * FROM read_json_auto('/dev/stdin')";
Copy code

Mark Roddy

Edited 02/23/24

‌

Using the EXCLUDE() function in DuckDB

Tired of copying/pasting many column names to select all columns except a handful? The EXCLUDE() function in SQL allows you to exclude specific columns from the result set. Together with COLUMNS() and REPLACE(), DuckDB provides an easy way to specify the data you want returned in the result set.

Select all columns except a fewSQL

// This will select all information about ducks
// except their height and weight
SELECT * EXCLUDE (height, weight) FROM ducks;
Copy code

Select all columns matching a RegexSQL

// Only return the columns matching a regular expression
// In this case, we're storing colors as individual columns
// named color_1, color_2, etc.  We want to return all
// species of ducks along with their specified colors.
SELECT species, COLUMNS('^color_\d+$') FROM ducks;

// alternatively, we could store the colors in a 
// column named colors which is a list type, allowing us
// to have multiple color values for each species.
Copy code

Ryan Boyd

Edited 11/02/24

‌

SUMMARIZESQL

Editor's note: the SUMMARIZE() function allows you to quickly understand your data. If you want to understand a little more about how it works under the hood, see Hamilton's other snippet on building your own SUMMARIZE() capabilities using built-in analytics functions.

Execute this SQL

SUMMARIZE some_tbl;
SUMMARIZE from 'some_file.csv';
--- summary informations about a given table
Copy code

Carlo Piovesan

Edited 02/23/24

‌

Select a sample of rows from a tableSQL

DuckDB allows sampling of data in your tables using a several different statistical techniques, usually to increase performance. The default sampling method is used in this case- this is a bernoulli variant where each vector has a specified chance of being included in the result set.

Execute this SQL

-- uses system sampling to select 10% of the people rows
SELECT * FROM people USING SAMPLE 10%;
Copy code

Ryan Boyd

Edited 02/23/24

‌

Parse a File in an Unsupported FormatSQL

Editor's note: as data engineers, we're often burdened with data that's not in a standard format. Using DuckDB's basic string functions, advanced regex functions, list functions and the CSV parser, you can parse data of arbitrary formats.

Execute this SQL

/*
  This query pulls the contents of a file into a a single row/column, 
  and then parses it in the select statement. As an example, we're 
  parsing /proc/{pid}/environ files which are record separated by the 
  null char (\0) and column separated by the '=' char.
*/
SELECT
  str_split(filename, '/')[3] AS PID,   
  env_pair[1] AS Name,   
  env_pair[2] AS Value 
FROM (    
  SELECT
    filename,
    str_split(unnest(str_split(column0, chr(0))), '=') AS env_pair
    FROM read_csv(
           '/proc/[0-9]*/environ',
           header=False,
           filename=true,
           -- Make sure your 'delim' is a string not in the file
           delim='\0',
           columns={column0: 'VARCHAR'})
)
WHERE Name != ''
ORDER BY PID, Name
Copy code

Mark Roddy

Edited 02/23/24