Quickly convert a CSV to Parquet, bash function

Editor's note: DuckDB makes it easy to convert between a variety of popular data formats (CSV, JSON, Parquet, and more) using simple SQL statements. It's also easy to execute these statements from a bash shell so you have them ready to go.

Execute this Bash

#!/bin/bash 
function csv_to_parquet() {     
    file_path="$1"     
    duckdb -c "COPY (SELECT * FROM read_csv_auto('$file_path')) TO '${file_path%.*}.parquet' (FORMAT PARQUET);" }

Copy code

Mehdi Ouazza

Expand

Share link


Query the Output of Another Process

Editor's note: if you're executing command-line interfaces that output JSON, CSV or other common formats, DuckDB enables you to do ad hoc queries on the results using SQL.

Execute this Bash

# Only using JSON as an example format, not required.
cat some-file.json | duckdb -s "SELECT * FROM read_json_auto('/dev/stdin')";

Copy code

Mark Roddy

Expand

Share link


Filter column names using a pattern

Editor's note: DuckDB aims to make SQL even easier, while supporting standards whenever possible. When you have extremely wide tables, it's often helpful to return only columns matching a regex, and COLUMNS() does exactly that. With the EXCLUDE() and REPLACE() functions you get even more simplicity.

Execute this SQL

-- select only the column names that start with the dim_
SELECT COLUMNS('^dim_') FROM fact_table;

Copy code

Octavian Zarzu

Expand

Share link


Select all columns except a few

Editor's note: tired of copying/pasting many column names to select all columns except a handful? The EXCLUDE() function allows you to exclude specific columns from the result set. Together with COLUMNS() and REPLACE() it provides an easy way to specific the data you want returned.

Execute this SQL

SELECT * EXCLUDE (column1, colum2) FROM ducks;

Copy code

Ryan Boyd

Expand

Share link


SUMMARIZE

Editor's note: the SUMMARIZE() function allows you to quickly understand your data. If you want to understand a little more about how it works under the hood, see Hamilton's other snippet on building your own SUMMARIZE() capabilities using built-in analytics functions.

Execute this SQL

SUMMARIZE some_tbl;
SUMMARIZE from 'some_file.csv';
--- summary informations about a given table

Copy code

Carlo Piovesan

Expand

Share link


Select a sample of rows from a table

DuckDB allows sampling of data in your tables using a several different statistical techniques, usually to increase performance. The default sampling method is used in this case- this is a bernoulli variant where each vector has a specified chance of being included in the result set.

Execute this SQL

-- uses system sampling to select 10% of the people rows
SELECT * FROM people USING SAMPLE 10%;

Copy code

Ryan Boyd

Expand

Share link


Parse a File in an Unsupported Format

Editor's note: as data engineers, we're often burdened with data that's not in a standard format. Using DuckDB's basic string functions, advanced regex functions, list functions and the CSV parser, you can parse data of arbitrary formats.

Execute this SQL

/*
  This query pulls the contents of a file into a a single row/column, 
  and then parses it in the select statement. As an example, we're 
  parsing /proc/{pid}/environ files which are record separated by the 
  null char (\0) and column separated by the '=' char.
*/
SELECT
  str_split(filename, '/')[3] AS PID,   
  env_pair[1] AS Name,   
  env_pair[2] AS Value 
FROM (    
  SELECT
    filename,
    str_split(unnest(str_split(column0, chr(0))), '=') AS env_pair
    FROM read_csv(
           '/proc/[0-9]*/environ',
           header=False,
           filename=true,
           -- Make sure your 'delim' is a string not in the file
           delim='\0',
           columns={column0: 'VARCHAR'})
)
WHERE Name != ''
ORDER BY PID, Name

Copy code

Mark Roddy

Expand

Share link


Unnest JSON Array into Rows (pseudo-json_each)

Editor's note: as APIs and applications increasingly represent tabular data in JSON format, we sometimes want the comfort of seeing it as rows, queryable in SQL. DuckDB has a great extension for parsing JSON, including the ability to convert into a DuckDB list which can then be unnested into rows.

Execute this SQL

select 
    -- Split DuckDB list into multiple rows
    unnest(
        -- Convert JSON array into DuckDB list of JSON elements
        from_json(
            json('[{"col1":1, "col2":2}, {"col1":10, "col2":20}]'),
            '["JSON"]'
        )
    );
/* 
Returns:
{"col1":1,"col2":2}
{"col1":10,"col2":20}
*/

Copy code

Alex Monahan

Expand

Share link