Query Parquet data in S3Bash
Editor's note: DuckDB users often work with files in Parquet format, which has become a standard for representing data in data lakes. While DuckDB lets you work with local Parquet files, you can also use files stored in blob storage such as Amazon AWS S3, Azure Blob Storage and Google Cloud Storage.
Execute this Bash
# Assuming you have the following environment variables defined: # AWS_ACCESS_KEY_ID # AWS_SECRET_ACCESS_KEY # AWS_DEFAULT_REGION duckdb -c 'LOAD httpfs; SELECT count(*) FROM read_parquet("s3://<bucket>/<prefix>/*.parquet");'
Copy code
Expand
Share link
SUMMARIZESQL
Editor's note: the SUMMARIZE() function allows you to quickly understand your data. If you want to understand a little more about how it works under the hood, see Hamilton's other snippet on building your own SUMMARIZE() capabilities using built-in analytics functions.
Execute this SQL
SUMMARIZE some_tbl; SUMMARIZE from 'some_file.csv'; --- summary informations about a given table
Copy code
Expand
Share link
Compute a metric for each numeric column and return the values in a long table (requires 0.7.2+)SQL
Editor's note: if your data originates as different types or in a format like CSV, you might want to do so without risking throwing an error for oddly-typed values. You can do so with TRY_CAST(), which will attempt a CAST but return NULL if not possible.
Execute this SQL
with computed as ( select sum(try_cast(columns(*) as double)) from read_csv_auto('aapl.csv') ) select -- restore original column names trim(list_element(regexp_extract_all(name,'\.(.*?) AS',1),1),'"') as name, value from (pivot_longer computed on *)
Copy code
Expand
Share link
Convert CSV to Parquet and provide schema to useBash
Editor's note: while there are other snippets showing file conversion, Parth's shows you how to convert from CSV to Parquet files using DuckDB with specification of the entire schema (columns) and compression codec.
Execute this Bash
duckdb -c "COPY (SELECT * FROM read_csv('pageviews-sanitized-20230101-000000.csv', delim=' ', header=False, columns={'domain_code': 'VARCHAR', 'page_title': 'VARCHAR', 'count_views': 'UINTEGER', 'total_response_size': 'UINTEGER'})) TO 'pageviews-sanitized-20230101-000000.parquet' (FORMAT 'PARQUET', CODEC 'zstd')"
Copy code
Expand
Share link
Combine several parquet files into one and compress with zstdBash
Editor's note: another great example of using DuckDB's wide data format support to merge/combine multiple Parquet files. Parth also kindly shows you how to compress the resulting Parquet file with the zstd codec. DuckDB also supports gzip and snappy compression codecs.
Execute this Bash
duckdb -c "COPY (SELECT * FROM '~/Binance_Spot_Data/*.parquet') TO 'binance.parquet' (FORMAT 'PARQUET', CODEC 'zstd')"
Copy code
Expand
Share link
Boxplot quantiles all at onceSQL
Editor's note: with a variety of supported statistical functions, including things like sampling and producing quantiles, DuckDB provides great analytical capabilities not always present in other databases.
Execute this SQL
select unnest([.1,.25,.5,.75,.9]) as quantile, unnest(quantile_cont(i, [.1,.25,.5,.75,.9])) as value from range(101) t(i); /* This returns: | quantile | value | | 0.10 | 10 | | 0.25 | 25 | | 0.50 | 50 | | 0.75 | 75 | | 0.90 | 90 | */
Copy code
Expand
Share link
Convert EPOC integer field to TimestampSQL
Editor's note: date/time conversion can be painful in SQL and in many programming languages. Simon shows you how to handle it in DuckDB here, with timezone values. There are other timestamp without timezone functions in the DuckDB docs.
Execute this SQL
select ('EPOCH'::TIMESTAMP + INTERVAL (epoc_field::INT) seconds)::TIMESTAMPTZ as epoc_tz from ( -- epoc column - number of seconds since 1 Jan 1970 select 1682481422 as epoc_field ) ;
Copy code
Expand
Share link
Execute this SQL
CALL pragma_version();
Copy code
Copy code
Expand
Share link