Welcome to the Learning DuckDB newsletter!
Thanks for signing up, even without knowing what to expect! š I will try to do my best to keep you updated on the latest news and developments in the DuckDB ecosystem, as well as sharing some interesting articles and resources that I come across.
Another focus of this newsletter will be to share some SQL tips & tricks that I learned about in the past, and found useful.
Iām aiming for at least one newsletter per month, but Iāll try to do it more often if there is something interesting to share. Iām very interested in your feedback, so please reach out if you have any questions or suggestions:
Thanks in advance,
Tobi
About this newsletter
Iād like to share a little bit how this newsletter will āworkā.
Categories
There will be four different categories in this newsletter:
- SQL Tips & Tricks: A list of SQL queries and their explanations that might be useful to you
- DuckDB News: News and updates about DuckDB itself (e.g. releases, blog posts, etc.)
- Community News: News and updates about the DuckDB community (e.g. new extensions, meetups, etc.)
- Articles & Resources: A list of articles and resources that I find interesting
Badges for SQL Tips & Tricks
I will make use of the following badges to indicate the level of difficulty for the SQL tips & tricks:
Run on SQL Workbench
Each SQL will have a badge that allows you to run the query on SQL Workbench. It looks like this:
Just click on it, and it will open and run the SQL query directly in your browser!
What is DuckDB?
DuckDB is a powerful and fast in-process analytical database system. It stands out for several key reasons:
Simple and Portable
- Zero external dependencies
- Runs in-process in its host application
- Available on all major platforms (Linux, macOS, Windows)
- Supports all popular hardware architectures
- Offers client APIs for major programming languages
- Can even run in a browser via WASM
Feature-Rich SQL Engine
- Provides a comprehensive and user-friendly SQL dialect
- Can read and write multiple file formats, e.g.
- CSV
- Parquet
- JSON
- Apache Arrow
- Supports both local file system and remote endpoints (like S3 buckets)
Performance-Focused
- Built with a columnar engine
- Supports parallel execution
- Can handle larger-than-memory workloads
- Optimized for analytical queries
Open and Extensible
- Open-source under the MIT License
- Supports third-party extensions for:
- New data types
- Custom functions
- Additional file formats
- Extended SQL syntax
You can read more about DuckDB at Why DuckDB, and in the DuckDB documentation.
If you want to install DuckDB on your machine, you can do so by following the installation instructions.
SQL Tips & Tricks
Youāll find a list of potentially useful SQL queries and their explanations below. Feel free to send me any questions or suggestions via [email protected]!
Loading a remote Parquet file
Letās read a remote Parquet file from GitHub, in this example containing a list of AWS services:
SELECT
*
FROM
'https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/parquet/aws_services.parquet';
Loading a remote CSV file
Letās read a remote CSV file from GitHub, in this example containing a list of AWS CloudFront Edge Locations:
SELECT
*
FROM
'https://raw.githubusercontent.com/tobilg/aws-edge-locations/main/data/aws-edge-locations.csv';
Querying a JSON API
Letās query the Open Meteo API for a 3 day temperature forecast for Amsterdam:
-- Import and load the JSON extension
IMPORT JSON;
LOAD JSON;
SELECT
unnest(hourly.time)::TIMESTAMP AS local_time,
unnest(hourly.temperature_2m) AS temperature
FROM
read_json('https://api.open-meteo.com/v1/forecast?latitude=52.374&longitude=4.8897&hourly=temperature_2m&forecast_days=3' auto_detect=true);
We use the unnest() function to get a row-like records from a nested JSON structure.
Prefixing columns with their table name
Often times, when you have multiple tables with the same column name, you might want to prefix them with their table name to avoid ambiguity. You can do this using the COLUMNS
function.
SELECT
COLUMNS(t1.*) AS 't1_\0',
COLUMNS(t2.*) AS 't2_\0'
FROM range(10) t1
JOIN range(10) t2 ON t1.range = t2.range;
Getting a quick summary of a dataset
We will use the SUMMARIZE statement to get a very fast overview of a query result. In this case itās a list of Public Cloud Provider IP address ranges:
SUMMARIZE SELECT
*
FROM
read_csv_auto('https://raw.githubusercontent.com/tobilg/public-cloud-provider-ip-ranges/main/data/providers/all.csv');
Using modern sytax for Top N by Group queries
The following examples are taken from the excellent blog post Fast Top N Aggregation and Filtering with DuckDB.
The below query returns the 3 most recent shipments for each supplier for an example dataset. It uses using the QUALIFY clause, acts like a WHERE clause, but specifically operates on the results of window functions:
FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet'
SELECT
*,
row_number() OVER (PARTITION BY l_suppkey ORDER BY l_shipdate DESC) AS my_ranking
QUALIFY
my_ranking <= 3;
This can be even more simplified with the following syntax:
FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet'
SELECT
l_suppkey,
max(l_shipdate, 3) AS top_3_shipdates
GROUP BY ALL;
We can even get the top 3 values for every column automatically:
FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet'
SELECT
max(COLUMNS(*), 3) AS "top_3_\0";
DuckDB News
Below youāll find some news about DuckDB itself:
Thereās a new PR in the DuckDB repo by Laurens Kuiper that will add an in-memory external file cache. This will significantly improve subsequent reads to the same data files (e.g. remote Parquet, JSON or CSV files). Really looking forward to that!
Another currently open PR from Pedro Holanda will eventually add read/write complex JSON support from Arrow data types. It extends the Arrow type metadata mechanism to handle complex JSON types, and also ensures that deserialization does not fail when encountering non-JSON strings.
The new Node client library for DuckDB duckdb-node-neo has a new version: 1.2.0-alpha.15
, containing multiple QoL improvements and bugfixes. Try it out if still using the old client! You can read more about it in the launch blog post.
Community News
Some relevant updates from the browser DuckDB Community:
To add to the DeepSeek hype, they released smallpond, which is a lightweight data processing framework built on DuckDB and 3FS. They managed to sort 110.5TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66TiB/min, on a cluster comprising 50 compute nodes and 25 storage nodes running 3FS. Impressive stuff, but probably a bit of overkill for the most of the DuckDB users. š¬
Thereās an update of the excellent Google Sheets extension, and Archie Sarre-Wood and Alex Monahan wrote a great blog post about it. Check it out if youāre woking with both DuckDB and GSheets!
Hannes MĆ¼hleisen wrote a blog post about a new v1.2.0
feature: Prefix Aliases in SQL. This is a very nice syntax extension, with which you can alias a column name with a prefix and a colon, instead of the AS
postfix syntax:
SELECT
e: 1 + 2,
f: len('asdf'),
s: (SELECT 42);
compared to
SELECT
1 + 2 AS e,
len('asdf') AS f,
(SELECT 42) AS s;
Articles & Resources
Mehdi Ouazza wrote an excellent article about DuckDB goes distributed? DeepSeekās smallpond takes on Big Data.
Tausendsassa (āJack of all tradesā, I love this word) Simon SpƤti published A Beginnerās Guide to Geospatial with DuckDB, so if youāre into geospatial data projects and want to learn about how DuckDB and itās spatial extension can be of help, this is a go-to resource!
Fin
Thanks for reading the first edition of the Learning DuckDB newsletter! Feel free to send me feedback if you have any!
Have a great day!
Tobi