Boxplot quantiles all at onceSQL

Editor's note: with a variety of supported statistical functions, including things like sampling and producing quantiles, DuckDB provides great analytical capabilities not always present in other databases.

Execute this SQL

select 
    unnest([.1,.25,.5,.75,.9]) as quantile,
    unnest(quantile_cont(i, [.1,.25,.5,.75,.9])) as value 
from range(101) t(i);

/* This returns:
| quantile | value |
|     0.10 |    10 |
|     0.25 |    25 |
|     0.50 |    50 |
|     0.75 |    75 |
|     0.90 |    90 |
*/

Copy code

Alex Monahan

Expand

Share link


Convert EPOC integer field to TimestampSQL

Editor's note: date/time conversion can be painful in SQL and in many programming languages. Simon shows you how to handle it in DuckDB here, with timezone values. There are other timestamp without timezone functions in the DuckDB docs.

Execute this SQL

select ('EPOCH'::TIMESTAMP + INTERVAL (epoc_field::INT) seconds)::TIMESTAMPTZ  as epoc_tz
from
(
    -- epoc column - number of seconds since 1 Jan 1970
    select 1682481422 as epoc_field
)
;

Copy code

Simon Aubury

Expand

Share link


See the current version of DuckDBSQL

Execute this SQL

CALL pragma_version();

Copy code

Simon Aubury

Copy code

Expand

Share link


Combine multiple parquet filesSQL

Execute this SQL

COPY (SELECT * FROM read_parquet('/path/to/parquet/files/*.parquet') TO '/path/to/combined/file.parquet' (FORMAT PARQUET)

Copy code

Mike Metzger

Copy code

Expand

Share link


Uncompresses a zip archive and converts each CSV file therein to ParquetR

Execute this R

library(tidyverse)
library(duckdb)
# Unzip the archive to a temp dir
zip_file <- "path/to/your/zipfile.zip"
output_folder <- "path/to/output/folder/"
temp_dir <- tempfile()
unzip(zip_file, exdir = temp_dir)
# Get list of CSV files in the temporary directory
csv_files <- list.files(temp_dir, pattern = "\\.csv$", full.names = TRUE)
# Loop through CSV files and convert to parquet file using DuckDB
for (csv_file in csv_files) {
    con <- duckdb::duckdb()
    input_table <- duckdb::read_csv(con, csv_file)
    parquet_file <- file.path (output_folder,
        paste0(tools::file_path_sans_ext(basename(csv_file)), ".parquet"))
    duckdb::write_parquet(output_table, parquet_file)
    duckdb::db_disconnect(con)
}
unlink(temp_dir, recursive=TRUE)

Copy code

Kyle Lundstedt

Copy code

Expand

Share link


Convert CSV to Parquet and amend the schemaSQL

Execute this SQL

-- This selects only specific fields from the CSV
-- It also renames one of the fields in the target schema
COPY (SELECT col1, col2, col3 AS new_col_name
	    FROM read_csv('~/data/source.csv',AUTO_DETECT=TRUE))
  TO '~/data/target.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');

Copy code

Robin Moffatt

Copy code

Expand

Share link


Filter column names using a patternSQL

Editor's note: DuckDB aims to make SQL even easier, while supporting standards whenever possible. When you have extremely wide tables, it's often helpful to return only columns matching a regex, and COLUMNS() does exactly that. With the EXCLUDE() and REPLACE() functions you get even more simplicity.

Execute this SQL

-- select only the column names that start with the dim_
SELECT COLUMNS('^dim_') FROM fact_table;

Copy code

Octavian Zarzu

Expand

Share link


Query Parquet files and their metadataSQL

Execute this SQL

-- listing files
FROM glob('dataset/*');

-- reading from files
FROM 'dataset/*.parquet' LIMIT 100; 

-- reading parquet files metadata (min_value, max_value, null_count for each field in each file)
FROM parquet_metadata('dataset/*.parquet');

-- convert files or export tables to parquet
COPY (FROM tbl) TO 'file.parquet' (FORMAT 'parquet');
COPY 'data.csv' TO 'data.paruqet' (FORMAT 'parquet');

Copy code

Octavian Zarzu

Copy code

Expand

Share link