Fetch HackerNews Stories with read_json()

You can read JSON from any endpoint into DuckDB. This uses read_json() to fetch all HackerNews 'Who is hiring' stories and create a table with all stories and the comment IDs. It works by: - creating a list of URLs - reading the list and creating the table - querying the table

Create JSON URLsSQL

-- https://hn.algolia.com/api
with urls as (
  select
    'https://hn.algolia.com/api/v1/search_by_date?tags=author_whoishiring&page=' || page_num || '&query=hiring?&attributesToRetrieve=title,tags,comment_text,created_at,object_id,story_id,story_title,children,story_text' as url
  from unnest(range(0,10)) as t(page_num)
)

select array_to_json(list(url)) from urls

Copy code

Read JSON from API endpointSQL

create or replace table hn_stories as (
select 
  hit.title as story_title,
  hit.story_text,
  hit.created_at,
  hit.objectId,
  hit.children,

from read_json(
 -- your_url_list_from_above --
  ), unnest(hits) as t(hit)
)

Copy code

Dumky de Wilde

Expand

Share link


List Hive PartitionsSQL

Given Hive partitioned data, list all the keys, values and files. Handy to get a summary of the structure of Hive data.

Execute this SQL

WITH hive_cols AS (
  PIVOT (
  SELECT
    file.split('/').list_filter(lambda x: x.contains('=')).list_transform(lambda x: x.split('=')).unnest() AS kv,
    file
  FROM glob('/path/to/hive/**/*.parquet')
  )
  ON kv[1]
  USING first(kv[2])
)
FROM hive_cols;

Copy code

Alejandro Wainzinger

Expand

Share link


Top 50 Most Queried Domains from DNS A Records in PCAP AnalysisSQL

This query parses a PCAP file to identify the top 50 most requested domains from DNS A record queries. It filters DNS packets, extracts domains from the _ws.col.info field, normalizes them to lowercase, counts occurrences, and sorts the results by frequency in descending order.

Execute this SQL

SELECT LOWER(REGEXP_EXTRACT("_ws.col.info", ' A ([^ ]+)', 1)) AS domain, COUNT(*) AS total
 FROM read_pcap('dump.pcap')
 WHERE "frame.protocols" LIKE '%dns%' 
          AND "_ws.col.info" LIKE '% A %' 
GROUP BY domain 
ORDER BY total DESC 
LIMIT 50;

Copy code

A
arrowsix

Expand

Share link


Use method chaining to filter tables

This is just a simple example of how to implement method chaining using `list` and `unnest` without any other extensions to provide a functional interface.

ExampleSQL

SELECT 
    list(t)
    .list_filter(lambda c: c.registration_date >= '2023-02-01')
    .list_transform(lambda c: struct_insert(c, 
        is_north_american:=c.country in ('USA', 'Mexico', 'Canada')))
    .list_filter(lambda c: c.is_north_american)
    .unnest(recursive:=true)
FROM 'customers.csv' AS t;

Copy code

customers.csvBash

customer_id,name,email,age,tier,country,registration_date,lifetime_value
1,Alice Johnson,alice.johnson@email.com,29,Premium,USA,2023-01-15,2847.50
2,Bob Smith,bob.smith@email.com,34,Standard,Canada,2023-02-03,1256.80
3,Charlie Brown,charlie.brown@email.com,42,Premium,UK,2023-01-22,3421.90
4,Diana Ross,diana.ross@email.com,38,Gold,USA,2023-03-10,5678.25
5,Eve Wilson,eve.wilson@email.com,26,Standard,Australia,2023-02-18,845.60
6,Frank Miller,frank.miller@email.com,31,Premium,Germany,2023-01-08,2156.75
7,Grace Lee,grace.lee@email.com,28,Standard,Canada,2023-03-05,967.40
8,Henry Davis,henry.davis@email.com,45,Gold,USA,2023-01-29,4532.80
9,Ivy Chen,ivy.chen@email.com,33,Premium,Singapore,2023-02-14,2890.45
10,Jack Wilson,jack.wilson@email.com,39,Standard,UK,2023-03-12,1134.20
11,Kate Brown,kate.brown@email.com,27,Premium,Australia,2023-01-18,2234.85
12,Leo Garcia,leo.garcia@email.com,36,Gold,Spain,2023-02-27,3789.60
13,Mia Taylor,mia.taylor@email.com,30,Standard,USA,2023-03-08,1089.95
14,Noah Kim,noah.kim@email.com,41,Premium,South Korea,2023-01-25,2567.30
15,Olivia Jones,olivia.jones@email.com,35,Gold,Canada,2023-02-11,4123.75
16,Paul Anderson,paul.anderson@email.com,32,Standard,UK,2023-03-15,876.40
17,Quinn Lee,quinn.lee@email.com,29,Premium,Singapore,2023-01-12,2678.95
18,Rachel Green,rachel.green@email.com,37,Gold,USA,2023-02-06,3956.80
19,Sam Wilson,sam.wilson@email.com,28,Standard,Australia,2023-03-01,1245.30
20,Tina Rodriguez,tina.rodriguez@email.com,44,Premium,Mexico,2023-01-31,2345.65

Copy code

Teague Sterling

Expand

Share link


Top values of all columnsSQL

Get the top values by count for all columns, useful for searching and analysis

Execute this SQL

select approx_top_k(columns(*), 3) from tbl;

Copy code

asrar

Expand

Share link


Remove NULL columnsSQL

Clean a table removing columns with only NULL values

Execute this SQL

CREATE OR REPLACE MACRO notnullcols(tablename) AS (
	WITH t1 AS (
			UNPIVOT( FROM query_table(tablename) SELECT max( columns(*)) ) 
			ON columns(*)::varchar
	) FROM t1 SELECT list(name) 
) ;

SET VARIABLE notnullcols = notnullcols(t);

FROM t SELECT columns(c -> list_contains(getvariable('notnullcols'), c)) ;

Copy code

MAUVIERE

Expand

Share link