From R data.table to Julia DataFrames.jl

From R data.table to Julia DataFrames.jl

January 7, 2024

Introduction #

This is meant as a resource for people trying to move between R and Julia who are using data.table on the one hand and DataFrames.jl on the other. I have gathered function I tend to use frequently or that I find frustrating to look for when I need them. There are probably glaring omissions. Feel free to open an issue here.

Resources #

  • I will mostly follow the excellent stata2r as a template.
  • You can also have a look at some of the mappings in the DataFrames.jl documentation
  • I have found this guide of data.table by Andrew Brooks very useful.

A word of warning #

  • Missing Data. Dealing with missing data in Julia requires to be explicit as most functions will return a missing value if an array includes a missing element. Some functions will error. I will try to point out the equivalence with the R code but be aware that the equivalence might not be always be strictly the same if you implement it. I show some of the examples on the Freedman dataset from the car package in R.
  • Need improvement.
    • A few specific sections could use some user inputs as the Julia solutions were less than ideal: Leads and lags; Dates; Complex merges

Issues #

  • This is a first draft and there might be a few mistakes throughout the document. I am not a professional and I probably picked up bad habits here and there.
  • Email me at [email protected]
  • File an issue on gitlab here if you want to start a dicussion or have ideas of things to add/change
  • Last updated on 2024-09-09

Installation #

Installation can sometime be challenging. This should not be an issue in Julia here. We will mostly use the base DataFrames package and the convenient macros in the DataFramesMeta package. To read in data we will use the CSV package while Julia supports many different file formats (see FileIO.jl)
Installation for data.table can be tricky especially if you want to benefit from the multicore features. I recommend that you look at the installation wiki for more details.
import Pkg
Pkg.add("DataFrames")
Pkg.add("DataFramesMeta")

# Load the packages
using DataFrames
using DataFramesMeta
install.packages("data.table")
# latest development version that has passed all tests:
data.table::update_dev_pkg()

# Load the package
library(data.table)

We will also use other packages to load auxiliary datasets, download data etc. I use the pipe macro in Julia and the magrittr package in R to compose commands (though both packages have some amount of chaining built-in)

# Install the packages
Pkg.add("CSV")
Pkg.add("ShiftedArrays") # lead/lag operators
Pkg.add("HTTP") # download utilities
Pkg.add("RDatasets") 
Pkg.add("Pipe") # pipes
Pkg.add("FlexiJoins") # for complex merges (by conditions)
Pkg.add("IntervalSets") # easier to formulate complex merge conditions
# Install the packages
install.packages("car")       # RDataset
install.packages("magrittr")  # pipes
install.packages("statar")    # stata-style data utilities
install.packages("lubridate") # date utilities
 

 
# Load the packages
using HTTP
using CSV
using Dates
using FlexiJoins
using IntervalSets
using RDatasets
using Pipe
using ShiftedArrays


# We will need some statistical function from the Base package
import Statistics: mean, median, quantile
# Load the packages
library(car)
library(magrittr)
library(statar) 
library(lubridate)

 

 

File I/O #

Reading Data #

I am using the flight dataset to run most of the examples. This assumes your data is in a csv format.

There are options in both languages to read from more efficient formats like parquet. I have found that csv is both fast and idioproof and lends itself to quick manipulation on the shell if I need to do something quickly.

In julia, the CSV.jl package does not allow to read directly from a url but we can use the HTTP.jl package to download the file.
In R, you can directly read the dataset from an url using fread in data.table.
# Flights data
url_flights = "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
url_get = HTTP.get(url_flights);
dat = CSV.read(url_get.body, DataFrame)

# If you are reading a local file and the file is compressed:
# then upon reading gets expanded to /tmp by default (see this [issue](https://github.com/JuliaData/CSV.jl/issues/988))
# If you are limited by how much to write on `/tmp` (HPC) this might trigger a bug and prevent you from reading.
# Instead use the `buffer_in_memory=true` option:
dat = CSV.read("file.csv.gz", DataFrame, buffer_in_memory=true)

# Freedman data
dat_missing = RDatasets.dataset("car", "Freedman")
# Flights data
url_flights = "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"

dat = fread(url_flights)

# In fread you can choose to which temp folder to write intermediary files using `tmpdir`
# Note: this is different from the option offered in `julia`
fread("file.csv.gz", tmpdir="~/tmp")

# Freedman data
dat_missing = data.table(Freedman) # load and convert to data.table

Writing Data #

To write the file we use a similar function. It is also possible to save the file with compression
Similarly data.table provides a write function with optional compression
CSV.write("./data.csv", dat)
CSV.write("./data.csv.gz", dat; compress=true)

If you want to set up a different compression algorithm (faster or more efficient) use TranscodingStreams.jl with the appropriate codec. For example if you want to use zst you would do

Pkg.add("CodecZstd")
using CodecZstd
open(ZstdCompressorStream, "./data.csv.zst", "w") do stream
    CSV.write(stream, dat)
end

# similarly to read it back
dat = open(ZstdDecompressorStream, "./data.csv.zst", "r") do stream
    CSV.read(stream, DataFrame)
end
fwrite(dat, "./data.csv")
fwrite(dat, "./data.csv.gz", compress="gzip") # with compression

(Benchmarks) #

I would like to include a benchmark for one large file and compare csv intake to parquet.

Inspecting the dataset #

This is before we start filtering, collapsing, or merging the data.

Sorting #

It is important to be able to sort rows. What are the most delayed flights, what about the most delayed flight on a specific day?

Note that the code changes the order “in-place” as it changes the dataset itself.

The most basic task is to sort some columns with respect to a specific variable
Similarly data.table provides a write function with optional compression
sort!(dat, :air_time)
sort!(dat, [:air_time, :dest])
sort!(dat, :air_time; rev=true)
sort!(dat, [order(:air_time, rev=true), :dest])

# if you do not want to change the dataset in place
sort(dat, :air_time)
setorder(dat, air_time) 
setorder(dat, air_time, dest) 
setorder(dat, -air_time)
setorder(dat, -air_time, dest)

# if you do not want to change the dataset in place
dat[ order(air_time)] # etc.

# To reorder a dataset programmatically use the `setorderv` function
col = "air_time"
setorderv(dat, col)

If we want to reorder columns (rather than rows)

select!(dat, [:month, :day], Not([:month, :day]))
setcolorder(dat, c("month", "day"))

Renaming #

Renaming does modify the dataset but it does not alter its data so we include it here.

This is where I start leaning on the macros from DataFramesMeta.jl
Similarly data.table provides a write function with optional compression
@rename!(dat, :new_arr_delay = :arr_delay)
@rename!(dat, :new_carrier = :carrier, :new_origin  = $"origin") 
@rename(dat, :new_arr_delay = :arr_delay) # not in place

rename!(x -> replace(x, "arr_" => "arrival_"), dat) # use the base DataFrames.jl function here
setnames(dat, "new_arr_delay", "arrival_delay") 
setnames(dat, c("carrier","origin"), c("new_carrier","new_origin")) 


setnames(dat, gsub("arr_", "arrival_", names(dat)))

Summary Statistics #

describe(dat)
describe(dat, :arr_delay)
describe(dat, :min, :detailed, cols=:arr_delay) 
sum_up(dat) # from statar package
sum_up(dat, arr_delay)
sum_up(dat, arr_delay, d = TRUE)

Tabulations #

summary_var = :carrier # or [:carrier, :origin]
@pipe dat |> groupby(_, summary_var) |>
    combine(_, nrow => :Freq, proprow => :Percent) |>
    @transform(_, :Cum = cumsum(:Percent))
tab(dat, carrier) # from statar package
tab(dat, carrier, origin)
# data.table version
dat[, .N, by = .(carrier, origin)][,
    `:=`(Percent=100*N/nrow(dat), Cum=100*cumsum(N)/nrow(dat)) ][]

Filtering #

Subsetting rows #

On the flights data #

There are multiple options for filtering; this is where I start leaning on the macros from DataFramesMeta.jl
In data.table filtering is not in place and you will need to assign the dataset (to itself) for the changes to propagate.
dat[1:200, :]
subset(dat, :day => x -> x .> 5 .& x .< 10) # from dataframes base
@subset(dat, :day .> 5 .& :day .< 10) # note the `.` for broadcasting
@rsubset(dat, :day > 5 & :day < 10)  # note the `r` for change by row
@rsubset!(dat, :day > 5 & :day < 10) # change in place
@rsubset(dat, :origin == "LGA")
@rsubset(dat, occursin(r"^LG", :origin))
@rsubset(dat, :month  [3, 4, 11, 12])
@rsubset(dat, :origin  ["JFK", "LGA"])
@rsubset(dat, :month != 1)
dat[1:200] 
dat[day > 5 & day < 10]        # filtering only


dat = dat[day > 5 & day < 10]  # filtering and reassigning
dat[origin=="LGA"]
dat[origin %like% "^LG"] 
dat[month %in% c(3,4,11,12)] 
dat[origin %chin% c("JFK","LGA")] # %chin% is a fast %in% for characters 
dat[month!=1]

On the Freedman data (with missing values) #

In this case we need to be careful on how we deal with missing data. Note that to find the missings julia uses the isequal function (because missing==missing returns missing).

subset(dat_missing, :Population => x -> x .< 1000; skipmissing=true)
@rsubset(dat_missing, :Population .< 1000) # treats missing values as false by default
@rsubset(dat_missing, isequal(:Population, missing) )
dat_missing[ population < 1000 ]

dat_missing[is.na(population)]

Dropping duplicate or missing values #

Drop duplicate values #

unique(dat; keep=:first)  # default
unique!(dat; keep=:first) # in place
unique(dat; keep=:noduplicates) # :last also an option
unique(dat, [:month, :day, :carrier])
unique(dat) 
dat = unique(dat) 

unique(dat, by = c("month", "day", "carrier"))

Drop missing values #

dropmissing(dat_missing)
dropmissing!(dat_missing) # in place
dropmissing(dat_missing, :Population)
dropmissing(dat_missing, [:Population, :Density])

# if the column type still include missing values convert the array to non missing types
disallowmissing(dat_missing)
disallowmissing(dat_missing, :Density)
na.omit(dat_missing) 
dat_missing = na.omit(dat_missing) 

dat_missing[!is.na(population)]
dat_missing[!is.na(population) & !is.na(density)]

Selecting columns #

# select columns
select(dat, :month, :day, :carrier)
select!(dat, :month, :day, :carrier)   # in place
select(dat, "month", "day", "carrier") # also works
select(dat, r"_delay")
select(dat, .!(eltype.(eachcol(dat)) .<: AbstractString) )
select(dat_missing, (eltype.(eachcol(dat_missing)) .<: Union{Missing, Int}) ) # if some columns include missing

# removing select columns
select(dat, Not([:origin, :dest]))
select!(dat, Not([:origin, :dest])) # in place
# select columns
dat[, .(month, day, carrier)] 
dat = dat[, .(month, day, carrier)] # "in place"
dat[, c("month", "day", "carrier")] # same but programmatic
dat[, .SD, .SDcols=patterns("*_delay")] # keep columns by matching
dat[, .SD, .SDcols=!is.character]       # keep columns by type


# removing select columns
dat[, -c("origin", "dest")]
dat[, c("origin", "dest") := NULL] # same, but in-place 

Rows and columns #

@select(@rsubset(dat, :origin=="LGA"), 
    :month, :day, :carrier)
@pipe dat |> @rsubset(_, :origin=="LGA") |> 
    @select(_, :month, :day, :carrier)
dat[origin=="LGA", .(month, day, carrier)]

Creating/modifying variables #

Basic operations #

@transform!(dat, :tot_delay = :dep_delay + :arr_delay)
@rtransform!(dat, :segment = :origin * :dest) # rowwise operation

# Programmatic version
x = "dep_delay"; y = "arr_delay"; z = "tot_delay";
@transform!(dat, $z = $x + $y)

# Conditional modification
dat[dat.month.==9, :distance] = dat[dat.month.==9, :distance] .+ 1;
dat[1:2, :air_time] .= 0;
# Using views (see https://bkamins.github.io/julialang/2024/05/10/jda.html)
@chain dat begin
    @rsubset(:month == 1; view=true)
    @rtransform!(:distance = :distance - 1)
    parent # to recover the source dat
end
# Or with missing values
dat_missing[isequal.(dat_missing.Population, missing), :City] .= "NOPOP";
# Using ternary operator
@rtransform!(dat_missing, :City = ismissing(:Population) ? "NOPOP" : :City)
dat[, tot_delay := dep_delay + arr_delay] 
dat[, segment := paste0(origin, dest) ]

# Programmatic version
x = "dep_delay"; y = "arr_delay"; z = "tot_delay"
dat[, c(z) := get(x) + get(y) ]

# Conditional modification
dat[month==9, distance := distance + 1]
dat[1:2, origin := "OBS"]
# Or with missing values
dat_missing[is.na(population), City := "NOPOP"]

 

  

Grouped operations #

@pipe dat |> groupby(_, :carrier) |> 
    @transform!(_, :avg_arr_delay = mean(:arr_delay)) # in place
@pipe dat |> groupby(_, :carrier) |> 
    @combine(_, :avg_arr_delay = mean(:arr_delay))
dat[, avg_arr_delay := mean(arr_delay), by=carrier] 

dat[, .(avg_arr_delay = mean(arr_delay, na.rm=T)), by=carrier]  # collapse see aggregation section below

Leads and lags #

Standard leads and lags #

I work with shifts on dates here but the dataset is a full panel with consecutive dates so there is nothing special about the dates variable per-se.

It is easier to see this on a smaller dataset. So I aggregate the data to get the total monthly flights out of each origin airport (see last section).

For leads and lags on arrays, I use the ShiftedArrays package which works fine for standard operations (read: as long as you are not dealing with dates).
For standard leads and lags, it is faster to use the built-in shift function from data.table.
dat_shift = combine(
    groupby(dat, [:origin, :month]), 
    nrow => :N)
sort!(dat_shift, [:origin, :month])    

@transform!(groupby(dat_shift, :origin), 
    :growth = :N ./ ShiftedArrays.lag(:N, 1))
@transform!(groupby(dat_shift, :origin), 
    :growth_since_first = :N ./ :N[1] )

# The following is probably not optimal; here are two versions
@pipe dat_shift |> 
    groupby(_, :origin) |> @subset(_, 5:month) |>  # this errors if month 5 is missing in a group
    groupby(_, :origin) |> @transform(_, :growth_since_may = :N ./ :N[:month.==5])
for subdf in groupby(dat_shift, :origin)    
    if 5  subdf.month
        @transform!(subdf, :growth_since_may = :N ./ :N[:month.==5])
    end
end
dat_shift = dat[, .N, by = .(origin, month)]
setorder(dat_shift, origin, month)



dat_shift[, growth := N/shift(N, 1), by = origin]

dat_shift[, growth_since_first := N/N[1], by = origin] 

dat_shift[, growth_since_may := N/N[month==5], by = origin]
dat_shift[, growth_since_may := .SD[["N"]]/.SD[month==5][["N"]], 
    .SDcols = c("N", "month"), by = origin]

The case of dates #

Dates are messy (imho). Lagging a variable by three months in a monthly panel does not necessarily translate into shifting the data by 3 indices (if the panel is unbalanced for example). The correct date function should check that “shifting” by three months in April corresponds to January and not December (if January is missing).

@rtransform!(dat, :date = Date(:year, :month, :day) )


# Including time
@rtransform!(dat, :date_time = DateTime(:year, :month, :day, :hour) )

@rtransform!(dat, :date_y = year(:date))
@rtransform!(dat, :f7d_date = :date + Dates.Day(7))
@rtransform!(dat, :l3m_date = :date - Dates.Month(3))
# Make a date variable using data.table built-in IDate
dat[, date := as.IDate(paste(year, month, day, sep='-'))] 
# It is usually faster to use lubridate parser 
dat[, date := parse_date_time2(paste(year, month, day, sep='-'), "Y-m-d")]
dat[, date_time := parse_date_time2(paste(year, month, day, hour, sep='-'), "Y-m-d-H")]

dat[, date_y := year(date)] # extract year
dat[, f7d_date := date + days(7) ]   # date in 7 days
dat[, l3m_date := date - months(3) ] # date three months ago

Once we know how to lag dates, we would like to answer questions such as: what was the average flight delay for each origin airport three months ago compared to today? We will work with the aggregate delays by origins.

In julia, the ShiftedArrays package does not support dates (See this post on discourse and this issue)

This is one of the most annoying thing when working with dates and panel data in julia. I have found that PanelShift.jl solves the problem but it is still in version 0.1.1 and it is unclear how many updates it is receiving. What is nice with julia is that you can simply loop over the data and do exactly what you want to do.

In R, I use the utility tlag and tlead from statar which lags based on date intervals.
@rtransform!(dat, :date = Date(:year, :month, :day) )
dat_shift = @combine(groupby(dat, [:origin, :date]), :arr_delay = mean(:arr_delay) )

# I could not find a built-in function, but julia is amenable to loops
dat_shift.l3m_arr_delay = Array{Union{Missing,Float64}}(undef, nrow(dat_shift));
for subdf in groupby(dat_shift, :origin)    
    for date_iter in subdf.date
           idx = isequal.(subdf.date, date_iter - Dates.Month(3))
        if (sum(idx)==1)
            subdf[ subdf.date .== date_iter, :l3m_arr_delay] .= subdf[idx, :arr_delay]
        end
    end
end
# sort(dat_shift, [:origin, :date])

# using PanelShift
@transform!(groupby(dat_shift, :origin),
    :l3m_arr_delay = tlag(:date, :arr_delay, Month(3) ) )
panellag!(dat_shift, :origin, :date, 
    :arr_delay, :l3m_arr_delay, Month(3))
dat_shift = dat[, .(arr_delay = mean(arr_delay, na.rm=T)), 
    by = .(origin, date=parse_date_time2(paste(year, month, day, sep="-"), "Y-m-d"))]

dat_shift[, l3m_arr_delay := tlag(arr_delay, n=months(3), time=date), by = .(origin) ]
# setorder(dat_shift, origin, date)

Advanced examples #

Applying functions to multiple variables #

  • Loops: if you have to use loops for convenience, data.table provides set which allows to change values withouth the overhead of data.table. The syntax is of the form set(dat, i, j,value), where i is the row index, and j the column index (or name).
# Loops
for col in (:tot_delay, :dep_delay) 
    dat[1:10, col] = - dat[1:10, col]
end

# Control flows 
@rtransform!(dat, :arr_delay_penaly = ifelse(:arr_delay>12, 1, -1) )
@rtransform!(dat, :arr_delay_penaly = ifelse( # no case function in julia
    :arr_delay>=15, 3, 
    ifelse(:arr_delay>=6, 2, 
    ifelse(:arr_delay>=0, 1, 0) ) ) )     

# Modify multiple variables at the same time (same function)
cols = [:origin, :dest]
transform!(dat, cols .=> (x -> x .* "Airport") .=> cols)


# Apply multiple functions to one variable
list_fun = [mean, median, x->quantile(x, 0.25)]
transform(dat, :dep_delay .=> list_fun .=> [:delay_mean, :delay_median, :delay_q25])

# Apply multiple functions to multiple variables (automatic renaming)
cols = [:dep_delay :arr_delay]
res_cols = kron(["mean_", "median_", "q25_"], string.(cols)) 
transform(dat, cols .=> list_fun .=> res_cols)



# Function of multiple variables (by rows)
ratio_airtime(x::NamedTuple) = (x[1] /(x[1]+x[2]))
@rtransform! dat :frac_air = ratio_airtime(AsTable([:air_time, :tot_delay]))
ratio_airtime(x, y) = (x /(x+y))
transform(dat, [:air_time, :tot_delay] => ByRow((ratio_airtime)) => :frac_air)
# Loops
for (j in c("tot_delay", "dep_delay")){ # (faster) loops 
    set(dat, 1:10, j=j, value=-dat[1:10][[j]])
}

# Control flows
dat[, arr_delay_penaly := fifelse(arr_delay>12, 1, -1) ]
dat[, arr_delay_penaly := fcase(arr_delay>=15, 3, 
                                arr_delay>=6,  2,
                                arr_delay>=0,  1,
                                default = 0) ]

# Modify multiple variables at the same time
cols = c("origin", "dest")
dat[, (cols) := lapply(.SD, \(x) paste(x,"Airport")), 
    .SDcols = cols]   

# Apply multiple functions to one variable  3b736f67aa239ba993b9674f5b5496bc))
list_fun = function(x) list(mean(x), median(x), quantile(x, 0.25))
dat[, c("delay_mean", "delay_median", "delay_q25") := sapply(.SD, list_fun), 
    .SDcols = c("dep_delay")]

# Apply multiple functions to multiple variables
dat[, as.list(unlist(lapply(.SD, list_fun))), .SDcols = c("dep_delay", "arr_delay") ] #<1>
# other method
melt(dat, measure.vars=c("dep_delay", "arr_delay") )[
    , sapply(.SD, list_fun), .SDcols = c("value"), by = .(variable)]

# Function of multiple variables
dat[, tot_delay := rowSums(.SD), .SDcols=patterns('*_delay')]
ratio_airtime = function(air, tot)  (air /(air+tot))
# dat[, frac_air := ratio_airtime(air_time, tot_delay) ]
dat[, frac_air := ratio_airtime(air_time, tot_delay), by=.I] # row-wise operation
  1. See caveat here

Applying complex functions #

# Regressions by groups
# using FixedEffectModels
for subdf in groupby(dat, [:year, :month])
    reg_res = reg(subdf, @formula(tot_delay ~ air_time))
    subdf[:, ] .= coef(reg_res)[2]
    subdf[: , ] .= sqrt.(vcov(reg_res)[2,2])
end
select(dat, [:year, :month, , ]) |> unique
# Regressions by groups
dat_reg = dat[, .(
    {
        y = as.matrix(.SD[["tot_delay"]]) 
        x = as.matrix(cbind(1, .SD[["air_time"]]) )
        reg_res = lm.fit(x, y)
        b = coefficients(reg_res)[2]
        se = sqrt(sum(reg_res[["residuals"]]^2) / var(.SD[["air_time"]]) ) / .N
        c(b,se)
    }, 
    seq(1,2) ),
    by = .(year, month) ]
dcast(dat_reg, year + month ~ V2, value.var="V1") # anticipating on reshape section

Aggregating #

@combine(dat, :mean_dep_delay = mean(:dep_delay))

@combine(dat, :mean_dep_delay = mean(:dep_delay), :mean_arr_delay = mean(:arr_delay))
combine(dat, [:dep_delay, :arr_delay] .=> mean .=> [:mean_dep_delay, :mean_arr_elay])
@combine(dat, $AsTable = mean.([:dep_delay, :arr_delay]))
@combine(dat, $([:dep_delay, :arr_delay] .=> mean) )
# More complex but useful for quantiles
@combine(dat, $AsTable = NamedTuple( (^(:q25_dep_delay), ^(:q75_dep_delay)) .=> 
    (quantile(:dep_delay, [0.25, 0.75])) ) )
dat[, mean(dep_delay)] # returns a scalar
dat[, .(mean_ddel = mean(dep_delay))] # returns a data.table
dat[, .(mean_ddel=mean(dep_delay), mean_adel=mean(arr_delay))]
dat[, lapply(.SD, mean), .SDcols=c('arr_delay','dep_delay')] 


# More than one variable
dat[, as.list(quantile(.SD, c(.25, .75), na.rm = TRUE)), .SDcols="dep_delay" ]   
# and merge back ...

Reshaping #

Wide to long #

Julia uses stack for going from wide to long.
In data.table, we use the built-in tool melt for going from wide to long.
# It is easier if we give all the flights a unique identifier 
@transform!(dat, :uid = 1:nrow(dat))
stack(dat, [:dep_delay, :arr_delay] )
stack(dat, r"_delay")
dat_long = stack(dat, 
    [:dep_delay, :arr_delay], 
    [:uid, :carrier, :origin, :dest])
# It is easier if we give all the flights a unique identifier
dat[, uid := seq(1, .N) ]
melt(dat, measure=c("dep_delay", "arr_delay"))
melt(dat, measure=patterns('_delay'))
dat_long = melt(dat, 
    measure=c("dep_delay", "arr_delay"),
    id.vars=c("uid", "carrier", "origin", "dest"))

Long to wide #

Julia uses unstack for going from long to wide.
In data.table, we use the built-in tool dcast for going from long to wide.
# We start with the long data from above
dat_wide = unstack(dat_long)
# If you only want to keep the id & *_delay cols
unstack(dat_long, :uid, :variable, :value)
# We start with the long data from above
dat_wide = dcast(dat_long, ... ~ variable)
# If you only want to keep the id & *_delay cols
dcast(dat_long, uid ~ variable)

Merging #

Basic merge #

# Load second dataset
dat_airports = CSV.read(
    HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/nycflights13/airports.csv").body,
    DataFrame) 

# Inner join
innerjoin(dat, dat_airports, on=:dest => :faa)
# if the datasets share a common name for the merge variable
@rename!(dat_airports, :dest=:faa)
innerjoin(dat, dat_airports, on=:dest)
# _join also have an in-place components that updates the first dataframe argument
innerjoin!(dat, dat_airports, on=:dest)


# with missing values
innerjoin(dat, dat_airports, on=:dest; matchmissing=:error)

# Other types of merge
# Left join
leftjoin(dat, dat_airports, on=:dest)
leftjoin(dat, dat_airports, on=:dest, source="_merge") # stata style merge info
# if there are missing values in the merge columns
leftjoin(dat, dat_airports, on=:dest, matchmissing=:notequal) # 
# Right join
rightjoin(dat, dat_airports, on=:dest)
# Outer join
outerjoin(dat, dat_airports, on=:dest)
# Semi join (filtering)
semijoin(dat, dat_airports, on=:dest)
# Anti join
antijoin(dat, dat_airports, on=:dest)
# Cross join
crossjoin(unique(select(dat, :origin)), unique(select(dat, :dest)) )
# Load second dataset
dat_airports = fread(
    "https://vincentarelbundock.github.io/Rdatasets/csv/nycflights13/airports.csv") 


# Inner join (default)
merge(dat, dat_airports, by.x = c("dest"), by.y = c("faa"))
# if the datasets share a common name for the merge variable
setnames(dat_airports, c("faa"), c("dest"))
merge(dat, dat_airports, by = c("dest"))

# with missing values
merge(dat, dat_airports, by = c("dest"))

# Other types of merge
# Left join
merge(dat, dat_airports, by = c("dest"), all.x = TRUE)

# Right join
merge(dat, dat_airports, by = c("dest"), all.y = TRUE)
# Outer join
merge(dat, dat_airports, by = c("dest"), all.x = TRUE, all.y = TRUE)
# Semi join (filtering)
merge(dat, dat_airports[, .(dest)], by = c("dest"))
# Anti join
dat[fsetdiff(dat[, .(dest)], dat_airports[, .(dest)]), on = "dest"]
# Cross join
CJ(unique(dat[["origin"]]), unique(dat[["dest"]])) # all combination of origin and destinations

Advanced merging #

Non-equi joins #

dat3 = DataFrame(carrier = ["AA", "UA"],
    start_month = [1, 4], end_month = [3, 6]) 

# Conditional join that catches everything between the distinct
# start and end dates for each carrier.
innerjoin((dat, dat3),
    by_key(:carrier) & 
    by_pred(:month, , x->x.start_month..x.end_month)
    )  
dat3 = data.table(carrier     = c('AA', 'UA'),
                  start_month = c(1, 4),
                  end_month   = c(3, 6)) 


dat[dat3, on = .(carrier,
                 month >= start_month,
                 month <= end_month)] 
 

Rolling joins #

# using FlexiJoins, Distances
# A finance example 
# ...  measurement ...
df1 = DataFrame(
    name = "S",
    date = Date.(["2024-01-01", "2024-01-03", "2024-01-05", "2024-01-07", "2024-01-10"]),
    price = [100, 102, 104, 106, 108]
)
# ... event dates
df2 = DataFrame(
    name = "S",
    event = ["Earnings Call", "Product Launch", "Investor Meeting"],
    event_date = Date.(["2024-01-04", "2024-01-08", "2024-01-11"]),
    # date = Date.(["2024-01-04", "2024-01-08", "2024-01-11"])
)

# rolling to the next value (from below)
leftjoin(
    (df1, df2),
    by_key(:name) & 
        by_pred(:date, <=, :event_date);
    multi=(identity, closest)
)

# I could not find a distance on dates so inelegantly I converted date to Int
const DATEREF = Date("1900-01-01")
@rtransform!(df1, :date_days = Dates.value(:date) - Dates.value(DATEREF) )
@rtransform!(df2, :event_date_days = Dates.value(:event_date) - Dates.value(DATEREF) )
# rolling to nearest 
leftjoin(
    (df1, df2),
    by_key(:name) & by_distance(:date_days, :event_date_days, Euclidean(), <=(Inf)),
    multi=(identity, closest)
    )
# ...  measurement ...
dt1 <- data.table(
  date = as.Date(c("2024-01-01", "2024-01-03", "2024-01-05", "2024-01-07", "2024-01-10")),
  price = c(100, 102, 104, 106, 108)
)
# ... and  event dates
dt2 <- data.table(
  event = c("Earnings Call", "Product Launch", "Investor Meeting"),
  event_date = as.Date(c("2024-01-04", "2024-01-08", "2024-01-11"))
)

# rolling to next value (from below)
dt2[dt1, on = .(event_date = date), roll = -Inf]
# rolling to nearest
dt2[dt1, on = .(event_date = date), roll = "nearest"]

Appending data #

vcat(dat, dat)
vcat(dat, dat, cols=:union)
reduce(vcat, [dat, dat])
reduce(vcat, [dat, dat], cols=:union)
rbind(dat, dat)
rbind(dat, dat, fill = TRUE)
rbindlist(list(dat, dat)) # useful if working with list (purrr)
rbindlist(list(dat, dat), fill = TRUE)


Julia, R