List Hive PartitionsSQL

Given Hive partitioned data, list all the keys, values and files. Handy to get a summary of the structure of Hive data.

Execute this SQL

WITH hive_cols AS (
  PIVOT (
  SELECT
    file.split('/').list_filter(lambda x: x.contains('=')).list_transform(lambda x: x.split('=')).unnest() AS kv,
    file
  FROM glob('/path/to/hive/**/*.parquet')
  )
  ON kv[1]
  USING first(kv[2])
)
FROM hive_cols;

Copy code

Alejandro Wainzinger

Expand

Share link


Top 50 Most Queried Domains from DNS A Records in PCAP AnalysisSQL

This query parses a PCAP file to identify the top 50 most requested domains from DNS A record queries. It filters DNS packets, extracts domains from the _ws.col.info field, normalizes them to lowercase, counts occurrences, and sorts the results by frequency in descending order.

Execute this SQL

SELECT LOWER(REGEXP_EXTRACT("_ws.col.info", ' A ([^ ]+)', 1)) AS domain, COUNT(*) AS total
 FROM read_pcap('dump.pcap')
 WHERE "frame.protocols" LIKE '%dns%' 
          AND "_ws.col.info" LIKE '% A %' 
GROUP BY domain 
ORDER BY total DESC 
LIMIT 50;

Copy code

A
arrowsix

Expand

Share link


Use method chaining to filter tables

This is just a simple example of how to implement method chaining using `list` and `unnest` without any other extensions to provide a functional interface.

ExampleSQL

SELECT 
    list(t)
    .list_filter(lambda c: c.registration_date >= '2023-02-01')
    .list_transform(lambda c: struct_insert(c, 
        is_north_american:=c.country in ('USA', 'Mexico', 'Canada')))
    .list_filter(lambda c: c.is_north_american)
    .unnest(recursive:=true)
FROM 'customers.csv' AS t;

Copy code

customers.csvBash

customer_id,name,email,age,tier,country,registration_date,lifetime_value
1,Alice Johnson,alice.johnson@email.com,29,Premium,USA,2023-01-15,2847.50
2,Bob Smith,bob.smith@email.com,34,Standard,Canada,2023-02-03,1256.80
3,Charlie Brown,charlie.brown@email.com,42,Premium,UK,2023-01-22,3421.90
4,Diana Ross,diana.ross@email.com,38,Gold,USA,2023-03-10,5678.25
5,Eve Wilson,eve.wilson@email.com,26,Standard,Australia,2023-02-18,845.60
6,Frank Miller,frank.miller@email.com,31,Premium,Germany,2023-01-08,2156.75
7,Grace Lee,grace.lee@email.com,28,Standard,Canada,2023-03-05,967.40
8,Henry Davis,henry.davis@email.com,45,Gold,USA,2023-01-29,4532.80
9,Ivy Chen,ivy.chen@email.com,33,Premium,Singapore,2023-02-14,2890.45
10,Jack Wilson,jack.wilson@email.com,39,Standard,UK,2023-03-12,1134.20
11,Kate Brown,kate.brown@email.com,27,Premium,Australia,2023-01-18,2234.85
12,Leo Garcia,leo.garcia@email.com,36,Gold,Spain,2023-02-27,3789.60
13,Mia Taylor,mia.taylor@email.com,30,Standard,USA,2023-03-08,1089.95
14,Noah Kim,noah.kim@email.com,41,Premium,South Korea,2023-01-25,2567.30
15,Olivia Jones,olivia.jones@email.com,35,Gold,Canada,2023-02-11,4123.75
16,Paul Anderson,paul.anderson@email.com,32,Standard,UK,2023-03-15,876.40
17,Quinn Lee,quinn.lee@email.com,29,Premium,Singapore,2023-01-12,2678.95
18,Rachel Green,rachel.green@email.com,37,Gold,USA,2023-02-06,3956.80
19,Sam Wilson,sam.wilson@email.com,28,Standard,Australia,2023-03-01,1245.30
20,Tina Rodriguez,tina.rodriguez@email.com,44,Premium,Mexico,2023-01-31,2345.65

Copy code

Teague Sterling

Expand

Share link


Top values of all columnsSQL

Get the top values by count for all columns, useful for searching and analysis

Execute this SQL

select approx_top_k(columns(*), 3) from tbl;

Copy code

asrar

Expand

Share link


Remove NULL columnsSQL

Clean a table removing columns with only NULL values

Execute this SQL

CREATE OR REPLACE MACRO notnullcols(tablename) AS (
	WITH t1 AS (
			UNPIVOT( FROM query_table(tablename) SELECT max( columns(*)) ) 
			ON columns(*)::varchar
	) FROM t1 SELECT list(name) 
) ;

SET VARIABLE notnullcols = notnullcols(t);

FROM t SELECT columns(c -> list_contains(getvariable('notnullcols'), c)) ;

Copy code

MAUVIERE

Expand

Share link


Delete NULL lines SQL

Clean a table removing lines where all cols values are NULL

Execute this SQL

WITH t AS (
VALUES (NULL, NULL),
  ('john', 'smith'),
  ('mike', NULL),
  (NULL, 'jones'),
  (NULL, NULL)
) FROM t
EXCEPT
FROM t
WHERE COLUMNS(*) IS NULL ;

Copy code

MAUVIERE

Expand

Share link


Macro to glimpse the table in the current DBSQL

An attempt to replicate the functionality of dplyrs (and polars) glimpse, with a simple function which takes the name of a table in the current database as a quoted string. Shows column name, data types and the first few values. Intended for use in the CLI. usage: FROM glimpse('table_name');

Execute this SQL

CREATE OR REPLACE MACRO glimpse(table_name) AS TABLE
WITH TableSchema AS (
    
    SELECT
        cid,  
        name AS column_name,
        type AS column_type
    FROM pragma_table_info(table_name)
),
SampleData AS (
    -- Select the first 5 rows from the target table
    SELECT *
    FROM query_table(table_name)
    LIMIT 5
),
SampleDataUnpivoted AS (
    -- Unpivot the sample data: columns become rows
    UNPIVOT (SELECT list(COLUMNS(*)::VARCHAR) FROM SampleData)
    ON COLUMNS(*)
    INTO
        NAME column_name
        VALUE sample_values_list -- This will be a list of strings
)
-- Final selection joining schema info with sample data
SELECT
    ts.column_name,
    ts.column_type,
    -- Convert the list to string and remove brackets for cleaner display
    regexp_replace(usp.sample_values_list::VARCHAR, '^\[|\]$', '', 'g') AS sample_data
FROM TableSchema ts
JOIN SampleDataUnpivoted usp ON ts.column_name = usp.column_name
ORDER BY ts.cid; 

Copy code

Steve Crawshaw

Expand

Share link