Boxplot quantiles all at onceSQL
Editor's note: with a variety of supported statistical functions, including things like sampling and producing quantiles, DuckDB provides great analytical capabilities not always present in other databases.
Execute this SQL
select unnest([.1,.25,.5,.75,.9]) as quantile, unnest(quantile_cont(i, [.1,.25,.5,.75,.9])) as value from range(101) t(i); /* This returns: | quantile | value | | 0.10 | 10 | | 0.25 | 25 | | 0.50 | 50 | | 0.75 | 75 | | 0.90 | 90 | */
Copy code
Expand
Share link
Convert EPOC integer field to TimestampSQL
Editor's note: date/time conversion can be painful in SQL and in many programming languages. Simon shows you how to handle it in DuckDB here, with timezone values. There are other timestamp without timezone functions in the DuckDB docs.
Execute this SQL
select ('EPOCH'::TIMESTAMP + INTERVAL (epoc_field::INT) seconds)::TIMESTAMPTZ as epoc_tz from ( -- epoc column - number of seconds since 1 Jan 1970 select 1682481422 as epoc_field ) ;
Copy code
Expand
Share link
See the current version of DuckDBSQL
Execute this SQL
CALL pragma_version();
Copy code
Copy code
Expand
Share link
Combine multiple parquet filesSQL
Execute this SQL
COPY (SELECT * FROM read_parquet('/path/to/parquet/files/*.parquet') TO '/path/to/combined/file.parquet' (FORMAT PARQUET)
Copy code
Copy code
Expand
Share link
Uncompresses a zip archive and converts each CSV file therein to ParquetR
Execute this R
library(tidyverse) library(duckdb) # Unzip the archive to a temp dir zip_file <- "path/to/your/zipfile.zip" output_folder <- "path/to/output/folder/" temp_dir <- tempfile() unzip(zip_file, exdir = temp_dir) # Get list of CSV files in the temporary directory csv_files <- list.files(temp_dir, pattern = "\\.csv$", full.names = TRUE) # Loop through CSV files and convert to parquet file using DuckDB for (csv_file in csv_files) { con <- duckdb::duckdb() input_table <- duckdb::read_csv(con, csv_file) parquet_file <- file.path (output_folder, paste0(tools::file_path_sans_ext(basename(csv_file)), ".parquet")) duckdb::write_parquet(output_table, parquet_file) duckdb::db_disconnect(con) } unlink(temp_dir, recursive=TRUE)
Copy code
Copy code
Expand
Share link
Convert CSV to Parquet and amend the schemaSQL
Execute this SQL
-- This selects only specific fields from the CSV -- It also renames one of the fields in the target schema COPY (SELECT col1, col2, col3 AS new_col_name FROM read_csv('~/data/source.csv',AUTO_DETECT=TRUE)) TO '~/data/target.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');
Copy code
Copy code
Expand
Share link
Filter column names using a patternSQL
Editor's note: DuckDB aims to make SQL even easier, while supporting standards whenever possible. When you have extremely wide tables, it's often helpful to return only columns matching a regex, and COLUMNS() does exactly that. With the EXCLUDE() and REPLACE() functions you get even more simplicity.
Execute this SQL
-- select only the column names that start with the dim_ SELECT COLUMNS('^dim_') FROM fact_table;
Copy code
Expand
Share link
Query Parquet files and their metadataSQL
Execute this SQL
-- listing files FROM glob('dataset/*'); -- reading from files FROM 'dataset/*.parquet' LIMIT 100; -- reading parquet files metadata (min_value, max_value, null_count for each field in each file) FROM parquet_metadata('dataset/*.parquet'); -- convert files or export tables to parquet COPY (FROM tbl) TO 'file.parquet' (FORMAT 'parquet'); COPY 'data.csv' TO 'data.paruqet' (FORMAT 'parquet');
Copy code
Copy code
Expand
Share link