DuckDB and Julia

DuckDB and Julia

July 8, 2024

Sometimes the CSV reader in Julia is not as performant or “robust” as the data.tables’ fread. I have found that the DuckDB CSV reader is a good option to have and easy enough to use; it also offers the option to filter the stream in case you are only interested in reading a subset of rows.

Plus, it taught me a few things on how to use DuckDB within Julia for some operations where it can be more efficient.

Reading Data #

First we load (or install) the required packages; I use HTTP to load some remote data.

# Load the packages
using DataFrames, DataFramesMeta
using DuckDB
using HTTP

Then we need to tell the database that it will read data directly from memory (as opposed to some server for example)

con = DBInterface.connect(DuckDB.DB, ":memory:")

Here we are again going to read the flights data (see my general post on DataFrames.jl)

# to read from https file we install DuckDB packages
DBInterface.execute(con, "INSTALL httpfs;") 
DBInterface.execute(con, "LOAD httpfs;") 

url_flights = "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
query_str = """
    SELECT *
    FROM read_csv('$(url_flights)')
""";
@time dat = DBInterface.execute(con, query_str) |> DataFrame

If you wanted to read only the set of flights that occured in the month of April

query_str_where = """
    SELECT *
    FROM read_csv('$(url_flights)'),
    WHERE month == 4
""";
@time dat = DBInterface.execute(con, query_str_where) |> DataFrame

While there are no meaningful performance differences at this small scale, you can see how much smaller the allocations are in the second case.

Joins #

Since DuckDB mostly uses a SQL dialect, some joins are made relatively easy. Here is an example of a of “RANGE” join

# 
dat_merge = DataFrame(carrier = ["AA", "UA"], start_month = [1, 4], end_month = [3, 6]) 

# First we make sure the data is visible to DuckDB
DuckDB.register_data_frame(con, dat, "dat")
DuckDB.register_data_frame(con, dat_merge, "dat_merge")

# and the query is fairly self-explanatory
query_str = """
    SELECT *
    FROM dat d1, dat_merge d2
    WHERE d1.carrier = d2.carrier
        AND d1.month >= d2.start_month
        AND d1.month <= d2.end_month
""";
@time DBInterface.execute(con, query_str) |> DataFrame

Note that this is at this scale slightly more efficient than using FlexiJoins.jl as described here



Code
Julia