List Hive PartitionsSQL
Given Hive partitioned data, list all the keys, values and files. Handy to get a summary of the structure of Hive data.
Execute this SQL
WITH hive_cols AS ( PIVOT ( SELECT file.split('/').list_filter(lambda x: x.contains('=')).list_transform(lambda x: x.split('=')).unnest() AS kv, file FROM glob('/path/to/hive/**/*.parquet') ) ON kv[1] USING first(kv[2]) ) FROM hive_cols;
Copy code
Expand
Share link
Top 50 Most Queried Domains from DNS A Records in PCAP AnalysisSQL
This query parses a PCAP file to identify the top 50 most requested domains from DNS A record queries. It filters DNS packets, extracts domains from the _ws.col.info field, normalizes them to lowercase, counts occurrences, and sorts the results by frequency in descending order.
Execute this SQL
SELECT LOWER(REGEXP_EXTRACT("_ws.col.info", ' A ([^ ]+)', 1)) AS domain, COUNT(*) AS total FROM read_pcap('dump.pcap') WHERE "frame.protocols" LIKE '%dns%' AND "_ws.col.info" LIKE '% A %' GROUP BY domain ORDER BY total DESC LIMIT 50;
Copy code
Expand
Share link
Use method chaining to filter tables
This is just a simple example of how to implement method chaining using `list` and `unnest` without any other extensions to provide a functional interface.
ExampleSQL
SELECT list(t) .list_filter(lambda c: c.registration_date >= '2023-02-01') .list_transform(lambda c: struct_insert(c, is_north_american:=c.country in ('USA', 'Mexico', 'Canada'))) .list_filter(lambda c: c.is_north_american) .unnest(recursive:=true) FROM 'customers.csv' AS t;
Copy code
customers.csvBash
customer_id,name,email,age,tier,country,registration_date,lifetime_value 1,Alice Johnson,alice.johnson@email.com,29,Premium,USA,2023-01-15,2847.50 2,Bob Smith,bob.smith@email.com,34,Standard,Canada,2023-02-03,1256.80 3,Charlie Brown,charlie.brown@email.com,42,Premium,UK,2023-01-22,3421.90 4,Diana Ross,diana.ross@email.com,38,Gold,USA,2023-03-10,5678.25 5,Eve Wilson,eve.wilson@email.com,26,Standard,Australia,2023-02-18,845.60 6,Frank Miller,frank.miller@email.com,31,Premium,Germany,2023-01-08,2156.75 7,Grace Lee,grace.lee@email.com,28,Standard,Canada,2023-03-05,967.40 8,Henry Davis,henry.davis@email.com,45,Gold,USA,2023-01-29,4532.80 9,Ivy Chen,ivy.chen@email.com,33,Premium,Singapore,2023-02-14,2890.45 10,Jack Wilson,jack.wilson@email.com,39,Standard,UK,2023-03-12,1134.20 11,Kate Brown,kate.brown@email.com,27,Premium,Australia,2023-01-18,2234.85 12,Leo Garcia,leo.garcia@email.com,36,Gold,Spain,2023-02-27,3789.60 13,Mia Taylor,mia.taylor@email.com,30,Standard,USA,2023-03-08,1089.95 14,Noah Kim,noah.kim@email.com,41,Premium,South Korea,2023-01-25,2567.30 15,Olivia Jones,olivia.jones@email.com,35,Gold,Canada,2023-02-11,4123.75 16,Paul Anderson,paul.anderson@email.com,32,Standard,UK,2023-03-15,876.40 17,Quinn Lee,quinn.lee@email.com,29,Premium,Singapore,2023-01-12,2678.95 18,Rachel Green,rachel.green@email.com,37,Gold,USA,2023-02-06,3956.80 19,Sam Wilson,sam.wilson@email.com,28,Standard,Australia,2023-03-01,1245.30 20,Tina Rodriguez,tina.rodriguez@email.com,44,Premium,Mexico,2023-01-31,2345.65
Copy code
Expand
Share link
Top values of all columnsSQL
Get the top values by count for all columns, useful for searching and analysis
Execute this SQL
select approx_top_k(columns(*), 3) from tbl;
Copy code
Expand
Share link
Remove NULL columnsSQL
Clean a table removing columns with only NULL values
Execute this SQL
CREATE OR REPLACE MACRO notnullcols(tablename) AS ( WITH t1 AS ( UNPIVOT( FROM query_table(tablename) SELECT max( columns(*)) ) ON columns(*)::varchar ) FROM t1 SELECT list(name) ) ; SET VARIABLE notnullcols = notnullcols(t); FROM t SELECT columns(c -> list_contains(getvariable('notnullcols'), c)) ;
Copy code
Expand
Share link
Delete NULL lines SQL
Clean a table removing lines where all cols values are NULL
Execute this SQL
WITH t AS ( VALUES (NULL, NULL), ('john', 'smith'), ('mike', NULL), (NULL, 'jones'), (NULL, NULL) ) FROM t EXCEPT FROM t WHERE COLUMNS(*) IS NULL ;
Copy code
Expand
Share link
Macro to glimpse the table in the current DBSQL
An attempt to replicate the functionality of dplyrs (and polars) glimpse, with a simple function which takes the name of a table in the current database as a quoted string. Shows column name, data types and the first few values. Intended for use in the CLI. usage: FROM glimpse('table_name');
Execute this SQL
CREATE OR REPLACE MACRO glimpse(table_name) AS TABLE WITH TableSchema AS ( SELECT cid, name AS column_name, type AS column_type FROM pragma_table_info(table_name) ), SampleData AS ( -- Select the first 5 rows from the target table SELECT * FROM query_table(table_name) LIMIT 5 ), SampleDataUnpivoted AS ( -- Unpivot the sample data: columns become rows UNPIVOT (SELECT list(COLUMNS(*)::VARCHAR) FROM SampleData) ON COLUMNS(*) INTO NAME column_name VALUE sample_values_list -- This will be a list of strings ) -- Final selection joining schema info with sample data SELECT ts.column_name, ts.column_type, -- Convert the list to string and remove brackets for cleaner display regexp_replace(usp.sample_values_list::VARCHAR, '^\[|\]$', '', 'g') AS sample_data FROM TableSchema ts JOIN SampleDataUnpivoted usp ON ts.column_name = usp.column_name ORDER BY ts.cid;
Copy code
Expand
Share link