Thanks for signing up, even without knowing what to expect! 😀 I will try to do my best to keep you updated on the latest news and developments in the DuckDB ecosystem, as well as sharing some interesting articles and resources that I come across.

Another focus of this newsletter will be to share some SQL tips & tricks that I learned about in the past, and found useful.

I’m aiming for at least one newsletter per month, but I’ll try to do it more often if there is something interesting to share. I’m very interested in your feedback, so please reach out if you have any questions or suggestions:

Thanks in advance,
Tobi

image info

I’d like to share a little bit how this newsletter will “work”.

Badges for SQL Tips & Tricks

I will make use of the following badges to indicate the level of difficulty for the SQL tips & tricks:

Run on SQL Workbench

Each SQL will have a badge that allows you to run the query on SQL Workbench. It looks like this:

Just click on it, and it will open and run the SQL query directly in your browser!

What is DuckDB?

DuckDB is a powerful and fast in-process analytical database system. It stands out for several key reasons:

Simple and Portable

Zero external dependencies
Runs in-process in its host application
Available on all major platforms (Linux, macOS, Windows)
Supports all popular hardware architectures
Offers client APIs for major programming languages
Can even run in a browser via WASM

Feature-Rich SQL Engine

Provides a comprehensive and user-friendly SQL dialect
Can read and write multiple file formats, e.g.
- CSV
- Parquet
- JSON
- Apache Arrow
Supports both local file system and remote endpoints (like S3 buckets)

Performance-Focused

Built with a columnar engine
Supports parallel execution
Can handle larger-than-memory workloads
Optimized for analytical queries

Open and Extensible

Open-source under the MIT License
Supports third-party extensions for:
- New data types
- Custom functions
- Additional file formats
- Extended SQL syntax

You can read more about DuckDB at Why DuckDB, and in the DuckDB documentation.

If you want to install DuckDB on your machine, you can do so by following the installation instructions.

SQL Tips & Tricks

You’ll find a list of potentially useful SQL queries and their explanations below. Feel free to send me any questions or suggestions via feedback@learningduckdb.com!

Loading a remote Parquet file

Let’s read a remote Parquet file from GitHub, in this example containing a list of AWS services:

SELECT
  * 
FROM 
  'https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/parquet/aws_services.parquet';

Loading a remote CSV file

Let’s read a remote CSV file from GitHub, in this example containing a list of AWS CloudFront Edge Locations:

SELECT
  * 
FROM 
  'https://raw.githubusercontent.com/tobilg/aws-edge-locations/main/data/aws-edge-locations.csv';

Querying a JSON API

Let’s query the Open Meteo API for a 3 day temperature forecast for Amsterdam:

-- Import and load the JSON extension
IMPORT JSON;
LOAD JSON;

SELECT 
  unnest(hourly.time)::TIMESTAMP AS local_time, 
  unnest(hourly.temperature_2m) AS temperature 
FROM 
  read_json('https://api.open-meteo.com/v1/forecast?latitude=52.374&longitude=4.8897&hourly=temperature_2m&forecast_days=3' auto_detect=true);

We use the unnest() function to get a row-like records from a nested JSON structure.

Prefixing columns with their table name

Often times, when you have multiple tables with the same column name, you might want to prefix them with their table name to avoid ambiguity. You can do this using the COLUMNS function.

SELECT 
  COLUMNS(t1.*) AS 't1_\0',
  COLUMNS(t2.*) AS 't2_\0'
FROM range(10) t1 
JOIN range(10) t2 ON t1.range = t2.range;

Getting a quick summary of a dataset

We will use the SUMMARIZE statement to get a very fast overview of a query result. In this case it’s a list of Public Cloud Provider IP address ranges:

SUMMARIZE SELECT 
  * 
FROM 
  read_csv_auto('https://raw.githubusercontent.com/tobilg/public-cloud-provider-ip-ranges/main/data/providers/all.csv');

Using modern sytax for Top N by Group queries

The following examples are taken from the excellent blog post Fast Top N Aggregation and Filtering with DuckDB.

The below query returns the 3 most recent shipments for each supplier for an example dataset. It uses using the QUALIFY clause, acts like a WHERE clause, but specifically operates on the results of window functions:

FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet' 
SELECT
  *,
  row_number() OVER (PARTITION BY l_suppkey ORDER BY l_shipdate DESC) AS my_ranking
QUALIFY
  my_ranking <= 3;

This can be even more simplified with the following syntax:

FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet' 
SELECT
  l_suppkey,
  max(l_shipdate, 3) AS top_3_shipdates
GROUP BY ALL;

We can even get the top 3 values for every column automatically:

FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet' 
SELECT 
  max(COLUMNS(*), 3) AS "top_3_\0";

DuckDB News

Below you’ll find some news about DuckDB itself:

There’s a new PR in the DuckDB repo by Laurens Kuiper that will add an in-memory external file cache. This will significantly improve subsequent reads to the same data files (e.g. remote Parquet, JSON or CSV files). Really looking forward to that!

Another currently open PR from Pedro Holanda will eventually add read/write complex JSON support from Arrow data types. It extends the Arrow type metadata mechanism to handle complex JSON types, and also ensures that deserialization does not fail when encountering non-JSON strings.

The new Node client library for DuckDB duckdb-node-neo has a new version: 1.2.0-alpha.15, containing multiple QoL improvements and bugfixes. Try it out if still using the old client! You can read more about it in the launch blog post.

Community News

Some relevant updates from the browser DuckDB Community:

To add to the DeepSeek hype, they released smallpond, which is a lightweight data processing framework built on DuckDB and 3FS. They managed to sort 110.5TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66TiB/min, on a cluster comprising 50 compute nodes and 25 storage nodes running 3FS. Impressive stuff, but probably a bit of overkill for the most of the DuckDB users. 😬

There’s an update of the excellent Google Sheets extension, and Archie Sarre-Wood and Alex Monahan wrote a great blog post about it. Check it out if you’re woking with both DuckDB and GSheets!

Hannes Mühleisen wrote a blog post about a new v1.2.0 feature: Prefix Aliases in SQL. This is a very nice syntax extension, with which you can alias a column name with a prefix and a colon, instead of the AS postfix syntax:

SELECT 
  e: 1 + 2, 
  f: len('asdf'), 
  s: (SELECT 42);

compared to

SELECT 
  1 + 2 AS e, 
  len('asdf') AS f, 
  (SELECT 42) AS s;

Articles & Resources

Mehdi Ouazza wrote an excellent article about DuckDB goes distributed? DeepSeek’s smallpond takes on Big Data.

Tausendsassa (“Jack of all trades”, I love this word) Simon Späti published A Beginner’s Guide to Geospatial with DuckDB, so if you’re into geospatial data projects and want to learn about how DuckDB and it’s spatial extension can be of help, this is a go-to resource!

Fin

Thanks for reading the first edition of the Learning DuckDB newsletter! Feel free to send me feedback if you have any!

Have a great day!
Tobi

Welcome to the Learning DuckDB newsletter!

About this newsletter

Categories