Quickly convert a CSV to Parquet, bash functionBash
Editor's note: DuckDB makes it easy to convert between a variety of popular data formats (CSV, JSON, Parquet, and more) using simple SQL statements. It's also easy to execute these statements from a bash shell so you have them ready to go.
Execute this Bash
#!/bin/bash function csv_to_parquet() { file_path="$1" duckdb -c "COPY (SELECT * FROM read_csv_auto('$file_path')) TO '${file_path%.*}.parquet' (FORMAT PARQUET);" }
Copy code
Expand
Share link
Query the Output of Another ProcessBash
Editor's note: if you're executing command-line interfaces that output JSON, CSV or other common formats, DuckDB enables you to do ad hoc queries on the results using SQL.
Execute this Bash
# Only using JSON as an example format, not required. cat some-file.json | duckdb -s "SELECT * FROM read_json_auto('/dev/stdin')";
Copy code
Expand
Share link
Filter column names using a patternSQL
Editor's note: DuckDB aims to make SQL even easier, while supporting standards whenever possible. When you have extremely wide tables, it's often helpful to return only columns matching a regex, and COLUMNS() does exactly that. With the EXCLUDE() and REPLACE() functions you get even more simplicity.
Execute this SQL
-- select only the column names that start with the dim_ SELECT COLUMNS('^dim_') FROM fact_table;
Copy code
Expand
Share link
Using the EXCLUDE() function in DuckDB
Tired of copying/pasting many column names to select all columns except a handful? The EXCLUDE() function in SQL allows you to exclude specific columns from the result set. Together with COLUMNS() and REPLACE(), DuckDB provides an easy way to specify the data you want returned in the result set.
Select all columns except a fewSQL
// This will select all information about ducks // except their height and weight SELECT * EXCLUDE (height, weight) FROM ducks;
Copy code
Select all columns matching a RegexSQL
// Only return the columns matching a regular expression // In this case, we're storing colors as individual columns // named color_1, color_2, etc. We want to return all // species of ducks along with their specified colors. SELECT species, COLUMNS('^color_\d+$') FROM ducks; // alternatively, we could store the colors in a // column named colors which is a list type, allowing us // to have multiple color values for each species.
Copy code
Expand
Share link
SUMMARIZESQL
Editor's note: the SUMMARIZE() function allows you to quickly understand your data. If you want to understand a little more about how it works under the hood, see Hamilton's other snippet on building your own SUMMARIZE() capabilities using built-in analytics functions.
Execute this SQL
SUMMARIZE some_tbl; SUMMARIZE from 'some_file.csv'; --- summary informations about a given table
Copy code
Expand
Share link
Select a sample of rows from a tableSQL
DuckDB allows sampling of data in your tables using a several different statistical techniques, usually to increase performance. The default sampling method is used in this case- this is a bernoulli variant where each vector has a specified chance of being included in the result set.
Execute this SQL
-- uses system sampling to select 10% of the people rows SELECT * FROM people USING SAMPLE 10%;
Copy code
Expand
Share link
Parse a File in an Unsupported FormatSQL
Editor's note: as data engineers, we're often burdened with data that's not in a standard format. Using DuckDB's basic string functions, advanced regex functions, list functions and the CSV parser, you can parse data of arbitrary formats.
Execute this SQL
/* This query pulls the contents of a file into a a single row/column, and then parses it in the select statement. As an example, we're parsing /proc/{pid}/environ files which are record separated by the null char (\0) and column separated by the '=' char. */ SELECT str_split(filename, '/')[3] AS PID, env_pair[1] AS Name, env_pair[2] AS Value FROM ( SELECT filename, str_split(unnest(str_split(column0, chr(0))), '=') AS env_pair FROM read_csv( '/proc/[0-9]*/environ', header=False, filename=true, -- Make sure your 'delim' is a string not in the file delim='\0', columns={column0: 'VARCHAR'}) ) WHERE Name != '' ORDER BY PID, Name
Copy code
Expand
Share link