Quickly convert a CSV to Parquet, bash function
Editor's note: DuckDB makes it easy to convert between a variety of popular data formats (CSV, JSON, Parquet, and more) using simple SQL statements. It's also easy to execute these statements from a bash shell so you have them ready to go.
Execute this Bash
#!/bin/bash function csv_to_parquet() { file_path="$1" duckdb -c "COPY (SELECT * FROM read_csv_auto('$file_path')) TO '${file_path%.*}.parquet' (FORMAT PARQUET);" }
Copy code
Expand
Share link
Query the Output of Another Process
Editor's note: if you're executing command-line interfaces that output JSON, CSV or other common formats, DuckDB enables you to do ad hoc queries on the results using SQL.
Execute this Bash
# Only using JSON as an example format, not required. cat some-file.json | duckdb -s "SELECT * FROM read_json_auto('/dev/stdin')";
Copy code
Expand
Share link
Filter column names using a pattern
Editor's note: DuckDB aims to make SQL even easier, while supporting standards whenever possible. When you have extremely wide tables, it's often helpful to return only columns matching a regex, and COLUMNS() does exactly that. With the EXCLUDE() and REPLACE() functions you get even more simplicity.
Execute this SQL
-- select only the column names that start with the dim_ SELECT COLUMNS('^dim_') FROM fact_table;
Copy code
Expand
Share link
Select all columns except a few
Editor's note: tired of copying/pasting many column names to select all columns except a handful? The EXCLUDE() function allows you to exclude specific columns from the result set. Together with COLUMNS() and REPLACE() it provides an easy way to specific the data you want returned.
Execute this SQL
SELECT * EXCLUDE (column1, colum2) FROM ducks;
Copy code
Expand
Share link
SUMMARIZE
Editor's note: the SUMMARIZE() function allows you to quickly understand your data. If you want to understand a little more about how it works under the hood, see Hamilton's other snippet on building your own SUMMARIZE() capabilities using built-in analytics functions.
Execute this SQL
SUMMARIZE some_tbl; SUMMARIZE from 'some_file.csv'; --- summary informations about a given table
Copy code
Expand
Share link
Select a sample of rows from a table
DuckDB allows sampling of data in your tables using a several different statistical techniques, usually to increase performance. The default sampling method is used in this case- this is a bernoulli variant where each vector has a specified chance of being included in the result set.
Execute this SQL
-- uses system sampling to select 10% of the people rows SELECT * FROM people USING SAMPLE 10%;
Copy code
Expand
Share link
Parse a File in an Unsupported Format
Editor's note: as data engineers, we're often burdened with data that's not in a standard format. Using DuckDB's basic string functions, advanced regex functions, list functions and the CSV parser, you can parse data of arbitrary formats.
Execute this SQL
/* This query pulls the contents of a file into a a single row/column, and then parses it in the select statement. As an example, we're parsing /proc/{pid}/environ files which are record separated by the null char (\0) and column separated by the '=' char. */ SELECT str_split(filename, '/')[3] AS PID, env_pair[1] AS Name, env_pair[2] AS Value FROM ( SELECT filename, str_split(unnest(str_split(column0, chr(0))), '=') AS env_pair FROM read_csv( '/proc/[0-9]*/environ', header=False, filename=true, -- Make sure your 'delim' is a string not in the file delim='\0', columns={column0: 'VARCHAR'}) ) WHERE Name != '' ORDER BY PID, Name
Copy code
Expand
Share link
Unnest JSON Array into Rows (pseudo-json_each)
Editor's note: as APIs and applications increasingly represent tabular data in JSON format, we sometimes want the comfort of seeing it as rows, queryable in SQL. DuckDB has a great extension for parsing JSON, including the ability to convert into a DuckDB list which can then be unnested into rows.
Execute this SQL
select -- Split DuckDB list into multiple rows unnest( -- Convert JSON array into DuckDB list of JSON elements from_json( json('[{"col1":1, "col2":2}, {"col1":10, "col2":20}]'), '["JSON"]' ) ); /* Returns: {"col1":1,"col2":2} {"col1":10,"col2":20} */
Copy code
Expand
Share link