Page Body

What Is DuckDB?

If you are someone that works with data, you may have heard about DuckDB. Many people (1, 2, 3, 4, 5) are impressed with the versatility and usefulness of this tool, but when you first approach it, it may be hard to appreciate where it fits in with traditional database management systems (DBMSs).

Client/Server Versus Embedded

Most DBMSs you may be familiar with are based on a client/server model. The server process manages the database files, accepts connections to the database from client applications, and performs database actions on behalf of the clients. Examples of client/server DBMSs include:

In contrast, embedded DBMSs do not come as standalone applications, but are embedded in applications, often as an in-process library. Usually, they offer multiple storage modes (e.g., as a transient, in-memory database, a persistent, single file on-disk, or both). A popular example of an embedded DBMS is SQLite.

Transactional Versus Analytical

Transactional DBMSs (also known as Online Transaction Processing, OLTP, databases) are optimized to quickly read/write individual rows of data, while ensuring data integrity. They are designed to handle a high volume of transactions in real-time.

Typically, these systems are row-oriented, i.e., data is organized as a sequence of records, storing all values of a single row together. This allows for efficient read/write operations on a complete record, where transactions often involve inserting, updating, or deleting records.

The previously mentioned client/server DBMSs are also examples of transactional DBMSs.

Analytical DBMSs (also known as Online Analytical Processing, OLAP, databases) are designed to store and manage large volumes of data for analysis. They are optimized for quick query response times and advanced analytics.

Typically, these systems are column-oriented, i.e., each column of a table is stored separately, which means that reading a specific column's data can be quickly done without the need to process the rest of the rows' data. This is beneficial for queries that often involve aggregations and scans over large volumes of data.

ClickHouse is an example of a client/server, analytical DBMS.

So, What Is DuckDB?

DuckDB is an embedded, analytical DBMS that aims to be your data analysis and data engineering multi-tool. It is simple to install, has no external dependencies, and requires no server software to install, update, and maintain.

It runs on all major operating systems (GNU/Linux, macOS, Windows) and CPU architectures (x86, ARM), and can also run in web browsers using DuckDB-Wasm. It can be used as a transient, in-memory database or as a persistent, single file on-disk.

DuckDB provides extensive support for complex queries in SQL, as well as transactional guarantees (ACID properties) through their custom, bulk-optimized Multi-Version Concurrency Control (MVCC). Also, it is deeply integrated into Python and R for efficient, interactive data analysis.

To support OLAP workloads, DuckDB uses a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a vector) are processed in one operation. This greatly reduces the overhead present in traditional systems (e.g., PostgreSQL, MySQL, SQLite), where rows are sequentially processed.

DuckDB was originally developed by Mark Raasveldt and Hannes Mühleisen at the Centrum Wiskunde & Informatica (CWI) in the Netherlands (the same national research institute where Guido van Rossum created the Python programming language). They see it as their responsibility to society to make the results of their work freely available to all.

DuckDB is tested using continuous intergration (CI) using a test suite that contains millions of queries, and includes queries adapted from the test suites of SQLite, PostgreSQL, and MonetDB. In addition, various tests are run that stress DuckDB under heavy loads.

Like many FLOSS projects and scientific endeavors, DuckDB uses components from various FLOSS projects, draws inspiration from scientific publications, and acknowledges that it is standing on the shoulders of giants.

Conclusion

In a field overrun with the hyperbole of the vapid and corrupt (6, 7, 8, 9, 10, 11, 12), it is encouraging to find an empowering FLOSS tool created by (what appears to be) stable people with a deep knowledge of and passion for their specialty.

Enjoyed this post?

Subscribe to the feed for the latest updates.