Working with spatial data

Editor's note: Geospatial data is increasingly important for analytics - whether you're looking at data like store inventory, customer location or the weather. The spatial extension for DuckDB provides support for common data formats, calculations and searching within geometries.

Create a point from latitude and longitude pairsSQL

-- Install spatial extension
INSTALL spatial; 
LOAD spatial; 

-- Represent a latitude and longitude as a point
-- The Eiffel Tower in Paris, France has a 
-- latitude of 48.858935 and longitude of 2.293412
-- We can represent this location as a point
SELECT st_point(48.858935, 2.293412) AS Eiffel_Tower;

Copy code

Find the distance between two locations (in meters)SQL

-- Distance between the Eiffel Tower and the Arc de Triomphe in Paris
-- Using the EPSG spatial reference systems: 
-- EPSG:4326 geographic coordinates as latitude and longitude pairs
-- EPSG:27563 projection that covers northern France and uses meters 

SELECT
    st_point(48.858935, 2.293412) AS Eiffel_Tower, 
    st_point(48.873407, 2.295471) AS Arc_de_Triomphe,
    st_distance(
        st_transform(Eiffel_Tower, 'EPSG:4326', 'EPSG:27563'), 
        st_transform(Arc_de_Triomphe, 'EPSG:4326', 'EPSG:27563')
    ) AS Aerial_Distance_M;

Copy code

Find the country for given latitude and longitude locationSQL

-- Load the geometry outline for each country
-- Save the country name and "geom" border in table world_boundaries
CREATE OR REPLACE TABLE world_boundaries
AS
SELECT *
FROM st_read('https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/world-administrative-boundaries/exports/geojson');

-- Find the enclosing country for a given point
-- We can which country the Eiffel Tower is in 
SELECT name, region
FROM world_boundaries
WHERE ST_Within(st_point(2.293412, 48.858935) , geom);

Copy code

Simon Aubury

Expand

Share link


Loading remote parquet files

Editor's note: DuckDB supports lambda functions for easy transformation. It also supports reading data over http/https [in addition from blob stores]. Simon's snippet shows you how to use both of those capabilities to dynamically generate filenames and query them when globbing is unsupported.

Load a single parquet fileSQL

-- Query a single remote parquet file 
-- Query the January data from NY taxi data set
SELECT tpep_pickup_datetime, total_amount
FROM read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet')
LIMIT 10;

Copy code

Dynamically iterate over remote parquet files - lambda function and list_transformSQL

-- Dynamically create the expected filenames to load
-- eg, the files names `2023-01.parquet` to `2023-06.parquet`
-- a lambda function can use the result of the generate_series(1, 6) 
-- and the format command can the `{:02d}` format specifier 
-- to left-pad filenames with 0's

SELECT filename, 
min(tpep_pickup_datetime), 
max(tpep_pickup_datetime), 
avg(total_amount)
FROM read_parquet(
    list_transform(
        generate_series(1, 6),
        n -> format('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-{:02d}.parquet', n) 
    ), filename=true
)
GROUP BY filename
ORDER BY filename;

Copy code

Simon Aubury

Expand

Share link


Working with public REST API's

Editor's note: by combining the read_json capability of DuckDB with the ability to retrieve arbitrary https URLs, you can easily use DuckDB to read data from public and unprotected APIs. Note that you can't currently pass arbitrary HTTP headers to access protected APIs.

Query JSON from a public REST APISQL

-- DuckDB can directly query a public REST API returning a JSON 
-- Query the "tvmaze.com" API for the TV show "South Park"

SELECT name, type, summary
FROM read_json('https://api.tvmaze.com/singlesearch/shows?q=South%20Park', 
auto_detect=true, format='newline_delimited');

Copy code

Format the response from a public REST API SQL

-- Format the response from a public REST API 
-- Retrieve a list of temperature readings for Sydney, Australia 
-- The returned payload has hourly temps for 24 hours
-- We can use JSONPath to extract data from a JSON documemnt 
-- to find the most recent temperature

SELECT json_extract(hourly, '$.temperature_2m[1]')
from read_json('https://api.open-meteo.com/v1/forecast?latitude=-33.8678&longitude=151.2073&hourly=temperature_2m&forecast_days=1', auto_detect=true, format='newline_delimited');

Copy code

Simon Aubury

Expand

Share link


Discretize numeric column following specified thresholds

discretize() converts a numeric column into discrete ordered ids, taking into account a list of thresholds.

Execute this SQL

CREATE OR REPLACE MACRO discretize(v, l) AS (
	WITH t1 AS (
		SELECT unnest(list_distinct(l)) as j
	), t2 AS (
		SELECT COUNT(*) + 1 c FROM t1 
	  WHERE try_cast(j AS float) <= v
	) FROM t2 
	SELECT IF(v IS NULL, NULL, c) 
) ;

--Usage
FROM 'https://raw.githubusercontent.com/thewiremonkey/factbook.csv/master/data/c2127.csv'
SELECT name, value, discretize(value, [2,3,4,5]) AS class ;

Copy code

Éric Mauvière

Expand

Share link


Creating parameterized views with TVFs macros (Table-Valued functions)SQL

Execute this SQL

-- create macro
CREATE OR REPLACE MACRO udf_products_in_year (v_year, v_category)
AS TABLE
SELECT 
	name, 
	category, 
	created_at
FROM products 
WHERE category = v_category
AND year(created_at) = v_year;

-- select using the macro as you would do from a table
SELECT *
FROM udf_products_in_year (2020, 'Home and Garden');

| Copper Light	| Home and Garden	| 2020-04-05 00:00:00.000 |
| Pink Armchair	| Home and Garden	| 2020-06-23 00:00:00.000 |

-- input ddl and data
CREATE TABLE products 
(
	name varchar,
	category varchar,
	created_at timestamp
);

INSERT INTO products
VALUES
('Cream Sofa', 'Home and Garden', '2019-03-14'),
('Copper Light', 'Home and Garden', '2020-04-05'),
('Pink Armchair', 'Home and Garden', '2020-06-23');

Copy code

Octavian Zarzu

Copy code

Expand

Share link


Remove duplicatesSQL

Execute this SQL

/* removes duplicate rows at the order_id level */
SELECT * FROM orders
QUALIFY row_number() over (partition by order_id order by created_at) = 1

Copy code

Octavian Zarzu

Copy code

Expand

Share link