From R data.table to Julia DataFrames.jl
January 7, 2024
Introduction #
This is meant as a resource for people trying to move between R and Julia who are using data.table on the one hand and DataFrames.jl on the other. I have gathered function I tend to use frequently or that I find frustrating to look for when I need them. There are probably glaring omissions. Feel free to open an issue here.
Resources #
- I will mostly follow the excellent
stata2r
as a template. - You can also have a look at some of the mappings in the
DataFrames.jl
documentation - I have found this guide of data.table by Andrew Brooks very useful.
A word of warning #
- Missing Data. Dealing with missing data in Julia requires to be explicit as most functions will return a
missing
value if an array includes a missing element. Some functions will error. I will try to point out the equivalence with theR
code but be aware that the equivalence might not be always be strictly the same if you implement it. I show some of the examples on the Freedman dataset from thecar
package in R. - Need improvement.
- A few specific sections could use some user inputs as the Julia solutions were less than ideal: Leads and lags; Dates; Complex merges
Issues #
- This is a first draft and there might be a few mistakes throughout the document. I am not a professional and I probably picked up bad habits here and there.
- Email me at [email protected]
- File an issue on gitlab here if you want to start a dicussion or have ideas of things to add/change
- Last updated on 2024-09-09
Installation #
Julia
here.
We will mostly use the base DataFrames
package and the convenient macros in the DataFramesMeta
package.
To read in data we will use the CSV
package while Julia supports many different file formats (see
FileIO.jl
)
data.table
can be tricky especially if you want to benefit from the multicore features.
I recommend that you look at the
installation wiki for more details.
import Pkg
Pkg.add("DataFrames")
Pkg.add("DataFramesMeta")
# Load the packages
using DataFrames
using DataFramesMeta
install.packages("data.table")
# latest development version that has passed all tests:
data.table::update_dev_pkg()
# Load the package
library(data.table)
We will also use other packages to load auxiliary datasets, download data etc. I use the pipe macro in Julia and the magrittr package in R to compose commands (though both packages have some amount of chaining built-in)
# Install the packages
Pkg.add("CSV")
Pkg.add("ShiftedArrays") # lead/lag operators
Pkg.add("HTTP") # download utilities
Pkg.add("RDatasets")
Pkg.add("Pipe") # pipes
Pkg.add("FlexiJoins") # for complex merges (by conditions)
Pkg.add("IntervalSets") # easier to formulate complex merge conditions
# Install the packages
install.packages("car") # RDataset
install.packages("magrittr") # pipes
install.packages("statar") # stata-style data utilities
install.packages("lubridate") # date utilities
# Load the packages
using HTTP
using CSV
using Dates
using FlexiJoins
using IntervalSets
using RDatasets
using Pipe
using ShiftedArrays
# We will need some statistical function from the Base package
import Statistics: mean, median, quantile
# Load the packages
library(car)
library(magrittr)
library(statar)
library(lubridate)
File I/O #
Reading Data #
I am using the flight dataset to run most of the examples.
This assumes your data is in a csv
format.
There are options in both languages to read from more efficient formats like parquet. I have found that csv is both fast and idioproof and lends itself to quick manipulation on the shell if I need to do something quickly.
CSV.jl
package does not allow to read directly from a url but we can use the HTTP.jl
package to download the file.
url
using fread
in data.table
.
# Flights data
url_flights = "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
url_get = HTTP.get(url_flights);
dat = CSV.read(url_get.body, DataFrame)
# If you are reading a local file and the file is compressed:
# then upon reading gets expanded to /tmp by default (see this [issue](https://github.com/JuliaData/CSV.jl/issues/988))
# If you are limited by how much to write on `/tmp` (HPC) this might trigger a bug and prevent you from reading.
# Instead use the `buffer_in_memory=true` option:
dat = CSV.read("file.csv.gz", DataFrame, buffer_in_memory=true)
# Freedman data
dat_missing = RDatasets.dataset("car", "Freedman")
# Flights data
url_flights = "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
dat = fread(url_flights)
# In fread you can choose to which temp folder to write intermediary files using `tmpdir`
# Note: this is different from the option offered in `julia`
fread("file.csv.gz", tmpdir="~/tmp")
# Freedman data
dat_missing = data.table(Freedman) # load and convert to data.table
Writing Data #
data.table
provides a write function with optional compression
CSV.write("./data.csv", dat)
CSV.write("./data.csv.gz", dat; compress=true)
If you want to set up a different compression algorithm (faster or more efficient) use
TranscodingStreams.jl
with the appropriate codec.
For example if you want to use zst you would do
Pkg.add("CodecZstd")
using CodecZstd
open(ZstdCompressorStream, "./data.csv.zst", "w") do stream
CSV.write(stream, dat)
end
# similarly to read it back
dat = open(ZstdDecompressorStream, "./data.csv.zst", "r") do stream
CSV.read(stream, DataFrame)
end
fwrite(dat, "./data.csv")
fwrite(dat, "./data.csv.gz", compress="gzip") # with compression
(Benchmarks) #
I would like to include a benchmark for one large file and compare csv intake to parquet.
Inspecting the dataset #
This is before we start filtering, collapsing, or merging the data.
Sorting #
It is important to be able to sort rows. What are the most delayed flights, what about the most delayed flight on a specific day?
Note that the code changes the order “in-place” as it changes the dataset itself.
data.table
provides a write function with optional compression
sort!(dat, :air_time)
sort!(dat, [:air_time, :dest])
sort!(dat, :air_time; rev=true)
sort!(dat, [order(:air_time, rev=true), :dest])
# if you do not want to change the dataset in place
sort(dat, :air_time)
setorder(dat, air_time)
setorder(dat, air_time, dest)
setorder(dat, -air_time)
setorder(dat, -air_time, dest)
# if you do not want to change the dataset in place
dat[ order(air_time)] # etc.
# To reorder a dataset programmatically use the `setorderv` function
col = "air_time"
setorderv(dat, col)
If we want to reorder columns (rather than rows)
select!(dat, [:month, :day], Not([:month, :day]))
setcolorder(dat, c("month", "day"))
Renaming #
Renaming does modify the dataset but it does not alter its data so we include it here.
DataFramesMeta.jl
data.table
provides a write function with optional compression
@rename!(dat, :new_arr_delay = :arr_delay)
@rename!(dat, :new_carrier = :carrier, :new_origin = $"origin")
@rename(dat, :new_arr_delay = :arr_delay) # not in place
rename!(x -> replace(x, "arr_" => "arrival_"), dat) # use the base DataFrames.jl function here
setnames(dat, "new_arr_delay", "arrival_delay")
setnames(dat, c("carrier","origin"), c("new_carrier","new_origin"))
setnames(dat, gsub("arr_", "arrival_", names(dat)))
Summary Statistics #
describe(dat)
describe(dat, :arr_delay)
describe(dat, :min, :detailed, cols=:arr_delay)
sum_up(dat) # from statar package
sum_up(dat, arr_delay)
sum_up(dat, arr_delay, d = TRUE)
Tabulations #
summary_var = :carrier # or [:carrier, :origin]
@pipe dat |> groupby(_, summary_var) |>
combine(_, nrow => :Freq, proprow => :Percent) |>
@transform(_, :Cum = cumsum(:Percent))
tab(dat, carrier) # from statar package
tab(dat, carrier, origin)
# data.table version
dat[, .N, by = .(carrier, origin)][,
`:=`(Percent=100*N/nrow(dat), Cum=100*cumsum(N)/nrow(dat)) ][]
Filtering #
Subsetting rows #
On the flights data #
DataFramesMeta.jl
data.table
filtering is not in place and you will need to assign the dataset (to itself) for the changes to propagate.
dat[1:200, :]
subset(dat, :day => x -> x .> 5 .& x .< 10) # from dataframes base
@subset(dat, :day .> 5 .& :day .< 10) # note the `.` for broadcasting
@rsubset(dat, :day > 5 & :day < 10) # note the `r` for change by row
@rsubset!(dat, :day > 5 & :day < 10) # change in place
@rsubset(dat, :origin == "LGA")
@rsubset(dat, occursin(r"^LG", :origin))
@rsubset(dat, :month ∈ [3, 4, 11, 12])
@rsubset(dat, :origin ∈ ["JFK", "LGA"])
@rsubset(dat, :month != 1)
dat[1:200]
dat[day > 5 & day < 10] # filtering only
dat = dat[day > 5 & day < 10] # filtering and reassigning
dat[origin=="LGA"]
dat[origin %like% "^LG"]
dat[month %in% c(3,4,11,12)]
dat[origin %chin% c("JFK","LGA")] # %chin% is a fast %in% for characters
dat[month!=1]
On the Freedman data (with missing values) #
In this case we need to be careful on how we deal with missing data.
Note that to find the missings julia uses the isequal
function (because missing==missing
returns missing).
subset(dat_missing, :Population => x -> x .< 1000; skipmissing=true)
@rsubset(dat_missing, :Population .< 1000) # treats missing values as false by default
@rsubset(dat_missing, isequal(:Population, missing) )
dat_missing[ population < 1000 ]
dat_missing[is.na(population)]
Dropping duplicate or missing values #
Drop duplicate values #
unique(dat; keep=:first) # default
unique!(dat; keep=:first) # in place
unique(dat; keep=:noduplicates) # :last also an option
unique(dat, [:month, :day, :carrier])
unique(dat)
dat = unique(dat)
unique(dat, by = c("month", "day", "carrier"))
Drop missing values #
dropmissing(dat_missing)
dropmissing!(dat_missing) # in place
dropmissing(dat_missing, :Population)
dropmissing(dat_missing, [:Population, :Density])
# if the column type still include missing values convert the array to non missing types
disallowmissing(dat_missing)
disallowmissing(dat_missing, :Density)
na.omit(dat_missing)
dat_missing = na.omit(dat_missing)
dat_missing[!is.na(population)]
dat_missing[!is.na(population) & !is.na(density)]
Selecting columns #
# select columns
select(dat, :month, :day, :carrier)
select!(dat, :month, :day, :carrier) # in place
select(dat, "month", "day", "carrier") # also works
select(dat, r"_delay")
select(dat, .!(eltype.(eachcol(dat)) .<: AbstractString) )
select(dat_missing, (eltype.(eachcol(dat_missing)) .<: Union{Missing, Int}) ) # if some columns include missing
# removing select columns
select(dat, Not([:origin, :dest]))
select!(dat, Not([:origin, :dest])) # in place
# select columns
dat[, .(month, day, carrier)]
dat = dat[, .(month, day, carrier)] # "in place"
dat[, c("month", "day", "carrier")] # same but programmatic
dat[, .SD, .SDcols=patterns("*_delay")] # keep columns by matching
dat[, .SD, .SDcols=!is.character] # keep columns by type
# removing select columns
dat[, -c("origin", "dest")]
dat[, c("origin", "dest") := NULL] # same, but in-place
Rows and columns #
@select(@rsubset(dat, :origin=="LGA"),
:month, :day, :carrier)
@pipe dat |> @rsubset(_, :origin=="LGA") |>
@select(_, :month, :day, :carrier)
dat[origin=="LGA", .(month, day, carrier)]
Creating/modifying variables #
Basic operations #
@transform!(dat, :tot_delay = :dep_delay + :arr_delay)
@rtransform!(dat, :segment = :origin * :dest) # rowwise operation
# Programmatic version
x = "dep_delay"; y = "arr_delay"; z = "tot_delay";
@transform!(dat, $z = $x + $y)
# Conditional modification
dat[dat.month.==9, :distance] = dat[dat.month.==9, :distance] .+ 1;
dat[1:2, :air_time] .= 0;
# Using views (see https://bkamins.github.io/julialang/2024/05/10/jda.html)
@chain dat begin
@rsubset(:month == 1; view=true)
@rtransform!(:distance = :distance - 1)
parent # to recover the source dat
end
# Or with missing values
dat_missing[isequal.(dat_missing.Population, missing), :City] .= "NOPOP";
# Using ternary operator
@rtransform!(dat_missing, :City = ismissing(:Population) ? "NOPOP" : :City)
dat[, tot_delay := dep_delay + arr_delay]
dat[, segment := paste0(origin, dest) ]
# Programmatic version
x = "dep_delay"; y = "arr_delay"; z = "tot_delay"
dat[, c(z) := get(x) + get(y) ]
# Conditional modification
dat[month==9, distance := distance + 1]
dat[1:2, origin := "OBS"]
# Or with missing values
dat_missing[is.na(population), City := "NOPOP"]
Grouped operations #
@pipe dat |> groupby(_, :carrier) |>
@transform!(_, :avg_arr_delay = mean(:arr_delay)) # in place
@pipe dat |> groupby(_, :carrier) |>
@combine(_, :avg_arr_delay = mean(:arr_delay))
dat[, avg_arr_delay := mean(arr_delay), by=carrier]
dat[, .(avg_arr_delay = mean(arr_delay, na.rm=T)), by=carrier] # collapse see aggregation section below
Leads and lags #
Standard leads and lags #
I work with shifts on dates here but the dataset is a full panel with consecutive dates so there is nothing special about the dates variable per-se.
It is easier to see this on a smaller dataset. So I aggregate the data to get the total monthly flights out of each origin airport (see last section).
ShiftedArrays
package which works fine for standard operations (read: as long as you are not dealing with dates).
shift
function from data.table.
dat_shift = combine(
groupby(dat, [:origin, :month]),
nrow => :N)
sort!(dat_shift, [:origin, :month])
@transform!(groupby(dat_shift, :origin),
:growth = :N ./ ShiftedArrays.lag(:N, 1))
@transform!(groupby(dat_shift, :origin),
:growth_since_first = :N ./ :N[1] )
# The following is probably not optimal; here are two versions
@pipe dat_shift |>
groupby(_, :origin) |> @subset(_, 5 ∈ :month) |> # this errors if month 5 is missing in a group
groupby(_, :origin) |> @transform(_, :growth_since_may = :N ./ :N[:month.==5])
for subdf in groupby(dat_shift, :origin)
if 5 ∈ subdf.month
@transform!(subdf, :growth_since_may = :N ./ :N[:month.==5])
end
end
dat_shift = dat[, .N, by = .(origin, month)]
setorder(dat_shift, origin, month)
dat_shift[, growth := N/shift(N, 1), by = origin]
dat_shift[, growth_since_first := N/N[1], by = origin]
dat_shift[, growth_since_may := N/N[month==5], by = origin]
dat_shift[, growth_since_may := .SD[["N"]]/.SD[month==5][["N"]],
.SDcols = c("N", "month"), by = origin]
The case of dates #
Dates are messy (imho). Lagging a variable by three months in a monthly panel does not necessarily translate into shifting the data by 3 indices (if the panel is unbalanced for example). The correct date function should check that “shifting” by three months in April corresponds to January and not December (if January is missing).
@rtransform!(dat, :date = Date(:year, :month, :day) )
# Including time
@rtransform!(dat, :date_time = DateTime(:year, :month, :day, :hour) )
@rtransform!(dat, :date_y = year(:date))
@rtransform!(dat, :f7d_date = :date + Dates.Day(7))
@rtransform!(dat, :l3m_date = :date - Dates.Month(3))
# Make a date variable using data.table built-in IDate
dat[, date := as.IDate(paste(year, month, day, sep='-'))]
# It is usually faster to use lubridate parser
dat[, date := parse_date_time2(paste(year, month, day, sep='-'), "Y-m-d")]
dat[, date_time := parse_date_time2(paste(year, month, day, hour, sep='-'), "Y-m-d-H")]
dat[, date_y := year(date)] # extract year
dat[, f7d_date := date + days(7) ] # date in 7 days
dat[, l3m_date := date - months(3) ] # date three months ago
Once we know how to lag dates, we would like to answer questions such as: what was the average flight delay for each origin airport three months ago compared to today? We will work with the aggregate delays by origins.
In julia, the ShiftedArrays
package does not support dates (See this post on
discourse and this
issue)
This is one of the most annoying thing when working with dates and panel data in julia.
I have found that PanelShift.jl
solves the problem but it is still in version 0.1.1
and it is unclear how many updates it is receiving.
What is nice with julia is that you can simply loop over the data and do exactly what you want to do.
tlag
and tlead
from statar
which lags based on date intervals.
@rtransform!(dat, :date = Date(:year, :month, :day) )
dat_shift = @combine(groupby(dat, [:origin, :date]), :arr_delay = mean(:arr_delay) )
# I could not find a built-in function, but julia is amenable to loops
dat_shift.l3m_arr_delay = Array{Union{Missing,Float64}}(undef, nrow(dat_shift));
for subdf in groupby(dat_shift, :origin)
for date_iter in subdf.date
idx = isequal.(subdf.date, date_iter - Dates.Month(3))
if (sum(idx)==1)
subdf[ subdf.date .== date_iter, :l3m_arr_delay] .= subdf[idx, :arr_delay]
end
end
end
# sort(dat_shift, [:origin, :date])
# using PanelShift
@transform!(groupby(dat_shift, :origin),
:l3m_arr_delay = tlag(:date, :arr_delay, Month(3) ) )
panellag!(dat_shift, :origin, :date,
:arr_delay, :l3m_arr_delay, Month(3))
dat_shift = dat[, .(arr_delay = mean(arr_delay, na.rm=T)),
by = .(origin, date=parse_date_time2(paste(year, month, day, sep="-"), "Y-m-d"))]
dat_shift[, l3m_arr_delay := tlag(arr_delay, n=months(3), time=date), by = .(origin) ]
# setorder(dat_shift, origin, date)
Advanced examples #
Applying functions to multiple variables #
- Loops: if you have to use loops for convenience, data.table provides
set
which allows to change values withouth the overhead of data.table. The syntax is of the formset(dat, i, j,value)
, wherei
is the row index, andj
the column index (or name).
# Loops
for col in (:tot_delay, :dep_delay)
dat[1:10, col] = - dat[1:10, col]
end
# Control flows
@rtransform!(dat, :arr_delay_penaly = ifelse(:arr_delay>12, 1, -1) )
@rtransform!(dat, :arr_delay_penaly = ifelse( # no case function in julia
:arr_delay>=15, 3,
ifelse(:arr_delay>=6, 2,
ifelse(:arr_delay>=0, 1, 0) ) ) )
# Modify multiple variables at the same time (same function)
cols = [:origin, :dest]
transform!(dat, cols .=> (x -> x .* "Airport") .=> cols)
# Apply multiple functions to one variable
list_fun = [mean, median, x->quantile(x, 0.25)]
transform(dat, :dep_delay .=> list_fun .=> [:delay_mean, :delay_median, :delay_q25])
# Apply multiple functions to multiple variables (automatic renaming)
cols = [:dep_delay :arr_delay]
res_cols = kron(["mean_", "median_", "q25_"], string.(cols))
transform(dat, cols .=> list_fun .=> res_cols)
# Function of multiple variables (by rows)
ratio_airtime(x::NamedTuple) = (x[1] /(x[1]+x[2]))
@rtransform! dat :frac_air = ratio_airtime(AsTable([:air_time, :tot_delay]))
ratio_airtime(x, y) = (x /(x+y))
transform(dat, [:air_time, :tot_delay] => ByRow((ratio_airtime)) => :frac_air)
# Loops
for (j in c("tot_delay", "dep_delay")){ # (faster) loops
set(dat, 1:10, j=j, value=-dat[1:10][[j]])
}
# Control flows
dat[, arr_delay_penaly := fifelse(arr_delay>12, 1, -1) ]
dat[, arr_delay_penaly := fcase(arr_delay>=15, 3,
arr_delay>=6, 2,
arr_delay>=0, 1,
default = 0) ]
# Modify multiple variables at the same time
cols = c("origin", "dest")
dat[, (cols) := lapply(.SD, \(x) paste(x,"Airport")),
.SDcols = cols]
# Apply multiple functions to one variable 3b736f67aa239ba993b9674f5b5496bc))
list_fun = function(x) list(mean(x), median(x), quantile(x, 0.25))
dat[, c("delay_mean", "delay_median", "delay_q25") := sapply(.SD, list_fun),
.SDcols = c("dep_delay")]
# Apply multiple functions to multiple variables
dat[, as.list(unlist(lapply(.SD, list_fun))), .SDcols = c("dep_delay", "arr_delay") ] #<1>
# other method
melt(dat, measure.vars=c("dep_delay", "arr_delay") )[
, sapply(.SD, list_fun), .SDcols = c("value"), by = .(variable)]
# Function of multiple variables
dat[, tot_delay := rowSums(.SD), .SDcols=patterns('*_delay')]
ratio_airtime = function(air, tot) (air /(air+tot))
# dat[, frac_air := ratio_airtime(air_time, tot_delay) ]
dat[, frac_air := ratio_airtime(air_time, tot_delay), by=.I] # row-wise operation
- See caveat here
Applying complex functions #
# Regressions by groups
# using FixedEffectModels
for subdf in groupby(dat, [:year, :month])
reg_res = reg(subdf, @formula(tot_delay ~ air_time))
subdf[:, :β] .= coef(reg_res)[2]
subdf[: , :σ] .= sqrt.(vcov(reg_res)[2,2])
end
select(dat, [:year, :month, :β, :σ]) |> unique
# Regressions by groups
dat_reg = dat[, .(
{
y = as.matrix(.SD[["tot_delay"]])
x = as.matrix(cbind(1, .SD[["air_time"]]) )
reg_res = lm.fit(x, y)
b = coefficients(reg_res)[2]
se = sqrt(sum(reg_res[["residuals"]]^2) / var(.SD[["air_time"]]) ) / .N
c(b,se)
},
seq(1,2) ),
by = .(year, month) ]
dcast(dat_reg, year + month ~ V2, value.var="V1") # anticipating on reshape section
Aggregating #
@combine(dat, :mean_dep_delay = mean(:dep_delay))
@combine(dat, :mean_dep_delay = mean(:dep_delay), :mean_arr_delay = mean(:arr_delay))
combine(dat, [:dep_delay, :arr_delay] .=> mean .=> [:mean_dep_delay, :mean_arr_elay])
@combine(dat, $AsTable = mean.([:dep_delay, :arr_delay]))
@combine(dat, $([:dep_delay, :arr_delay] .=> mean) )
# More complex but useful for quantiles
@combine(dat, $AsTable = NamedTuple( (^(:q25_dep_delay), ^(:q75_dep_delay)) .=>
(quantile(:dep_delay, [0.25, 0.75])) ) )
dat[, mean(dep_delay)] # returns a scalar
dat[, .(mean_ddel = mean(dep_delay))] # returns a data.table
dat[, .(mean_ddel=mean(dep_delay), mean_adel=mean(arr_delay))]
dat[, lapply(.SD, mean), .SDcols=c('arr_delay','dep_delay')]
# More than one variable
dat[, as.list(quantile(.SD, c(.25, .75), na.rm = TRUE)), .SDcols="dep_delay" ]
# and merge back ...
Reshaping #
Wide to long #
stack
for going from wide to long.
melt
for going from wide to long.
# It is easier if we give all the flights a unique identifier
@transform!(dat, :uid = 1:nrow(dat))
stack(dat, [:dep_delay, :arr_delay] )
stack(dat, r"_delay")
dat_long = stack(dat,
[:dep_delay, :arr_delay],
[:uid, :carrier, :origin, :dest])
# It is easier if we give all the flights a unique identifier
dat[, uid := seq(1, .N) ]
melt(dat, measure=c("dep_delay", "arr_delay"))
melt(dat, measure=patterns('_delay'))
dat_long = melt(dat,
measure=c("dep_delay", "arr_delay"),
id.vars=c("uid", "carrier", "origin", "dest"))
Long to wide #
unstack
for going from long to wide.
dcast
for going from long to wide.
# We start with the long data from above
dat_wide = unstack(dat_long)
# If you only want to keep the id & *_delay cols
unstack(dat_long, :uid, :variable, :value)
# We start with the long data from above
dat_wide = dcast(dat_long, ... ~ variable)
# If you only want to keep the id & *_delay cols
dcast(dat_long, uid ~ variable)
Merging #
Basic merge #
# Load second dataset
dat_airports = CSV.read(
HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/nycflights13/airports.csv").body,
DataFrame)
# Inner join
innerjoin(dat, dat_airports, on=:dest => :faa)
# if the datasets share a common name for the merge variable
@rename!(dat_airports, :dest=:faa)
innerjoin(dat, dat_airports, on=:dest)
# _join also have an in-place components that updates the first dataframe argument
innerjoin!(dat, dat_airports, on=:dest)
# with missing values
innerjoin(dat, dat_airports, on=:dest; matchmissing=:error)
# Other types of merge
# Left join
leftjoin(dat, dat_airports, on=:dest)
leftjoin(dat, dat_airports, on=:dest, source="_merge") # stata style merge info
# if there are missing values in the merge columns
leftjoin(dat, dat_airports, on=:dest, matchmissing=:notequal) #
# Right join
rightjoin(dat, dat_airports, on=:dest)
# Outer join
outerjoin(dat, dat_airports, on=:dest)
# Semi join (filtering)
semijoin(dat, dat_airports, on=:dest)
# Anti join
antijoin(dat, dat_airports, on=:dest)
# Cross join
crossjoin(unique(select(dat, :origin)), unique(select(dat, :dest)) )
# Load second dataset
dat_airports = fread(
"https://vincentarelbundock.github.io/Rdatasets/csv/nycflights13/airports.csv")
# Inner join (default)
merge(dat, dat_airports, by.x = c("dest"), by.y = c("faa"))
# if the datasets share a common name for the merge variable
setnames(dat_airports, c("faa"), c("dest"))
merge(dat, dat_airports, by = c("dest"))
# with missing values
merge(dat, dat_airports, by = c("dest"))
# Other types of merge
# Left join
merge(dat, dat_airports, by = c("dest"), all.x = TRUE)
# Right join
merge(dat, dat_airports, by = c("dest"), all.y = TRUE)
# Outer join
merge(dat, dat_airports, by = c("dest"), all.x = TRUE, all.y = TRUE)
# Semi join (filtering)
merge(dat, dat_airports[, .(dest)], by = c("dest"))
# Anti join
dat[fsetdiff(dat[, .(dest)], dat_airports[, .(dest)]), on = "dest"]
# Cross join
CJ(unique(dat[["origin"]]), unique(dat[["dest"]])) # all combination of origin and destinations
Advanced merging #
Non-equi joins #
dat3 = DataFrame(carrier = ["AA", "UA"],
start_month = [1, 4], end_month = [3, 6])
# Conditional join that catches everything between the distinct
# start and end dates for each carrier.
innerjoin((dat, dat3),
by_key(:carrier) &
by_pred(:month, ∈, x->x.start_month..x.end_month)
)
dat3 = data.table(carrier = c('AA', 'UA'),
start_month = c(1, 4),
end_month = c(3, 6))
dat[dat3, on = .(carrier,
month >= start_month,
month <= end_month)]
Rolling joins #
# using FlexiJoins, Distances
# A finance example
# ... measurement ...
df1 = DataFrame(
name = "S",
date = Date.(["2024-01-01", "2024-01-03", "2024-01-05", "2024-01-07", "2024-01-10"]),
price = [100, 102, 104, 106, 108]
)
# ... event dates
df2 = DataFrame(
name = "S",
event = ["Earnings Call", "Product Launch", "Investor Meeting"],
event_date = Date.(["2024-01-04", "2024-01-08", "2024-01-11"]),
# date = Date.(["2024-01-04", "2024-01-08", "2024-01-11"])
)
# rolling to the next value (from below)
leftjoin(
(df1, df2),
by_key(:name) &
by_pred(:date, <=, :event_date);
multi=(identity, closest)
)
# I could not find a distance on dates so inelegantly I converted date to Int
const DATEREF = Date("1900-01-01")
@rtransform!(df1, :date_days = Dates.value(:date) - Dates.value(DATEREF) )
@rtransform!(df2, :event_date_days = Dates.value(:event_date) - Dates.value(DATEREF) )
# rolling to nearest
leftjoin(
(df1, df2),
by_key(:name) & by_distance(:date_days, :event_date_days, Euclidean(), <=(Inf)),
multi=(identity, closest)
)
# ... measurement ...
dt1 <- data.table(
date = as.Date(c("2024-01-01", "2024-01-03", "2024-01-05", "2024-01-07", "2024-01-10")),
price = c(100, 102, 104, 106, 108)
)
# ... and event dates
dt2 <- data.table(
event = c("Earnings Call", "Product Launch", "Investor Meeting"),
event_date = as.Date(c("2024-01-04", "2024-01-08", "2024-01-11"))
)
# rolling to next value (from below)
dt2[dt1, on = .(event_date = date), roll = -Inf]
# rolling to nearest
dt2[dt1, on = .(event_date = date), roll = "nearest"]
Appending data #
vcat(dat, dat)
vcat(dat, dat, cols=:union)
reduce(vcat, [dat, dat])
reduce(vcat, [dat, dat], cols=:union)
rbind(dat, dat)
rbind(dat, dat, fill = TRUE)
rbindlist(list(dat, dat)) # useful if working with list (purrr)
rbindlist(list(dat, dat), fill = TRUE)