Back to Code Snippets


Parse a File in an Unsupported Format

Editor's note: as data engineers, we're often burdened with data that's not in a standard format. Using DuckDB's basic string functions, advanced regex functions, list functions and the CSV parser, you can parse data of arbitrary formats.

Execute this SQL

/*
  This query pulls the contents of a file into a a single row/column, 
  and then parses it in the select statement. As an example, we're 
  parsing /proc/{pid}/environ files which are record separated by the 
  null char (\0) and column separated by the '=' char.
*/
SELECT
  str_split(filename, '/')[3] AS PID,   
  env_pair[1] AS Name,   
  env_pair[2] AS Value 
FROM (    
  SELECT
    filename,
    str_split(unnest(str_split(column0, chr(0))), '=') AS env_pair
    FROM read_csv(
           '/proc/[0-9]*/environ',
           header=False,
           filename=true,
           -- Make sure your 'delim' is a string not in the file
           delim='\0',
           columns={column0: 'VARCHAR'})
)
WHERE Name != ''
ORDER BY PID, Name

Copy code

Mark Roddy

Expand

Share link