Kleibergen-Paap F-statistics in stata and julia

Kleibergen-Paap F-statistics in stata and julia

October 28, 2021

This was a joint effort with Matthieu Gomez who created and maintain the high-dimensional fixed effect regression package for julia FixedEffectModels.jl. Valentin Haddad and myself uncovered this bug while working on our project on passive investing and market competition.

This post describes how the Kleibergen-Paap first-stage F-statistics can be misleading when using ivreg2 and ivreghdfe in stata.

In julia the FixedEffectModels package deals with this type of regressions and usually mimicks stata. We found the issue was due to a specific case in the Kleibergen-Paap original paper and were able to patch it here. We find that the issue appears with instruments interacted with fixed effects.

tl;dr; We found potential mistakes in stata’s ivreg2 and ivreghdfe for special cases. We found a way to fix it in julia’s FixedEffectModels; stata developers should look into patching it.

Framework #

We consider a model with two groups indexed by \(k \in {1,2}\). We are interested in the following regression: $$ \begin{align} Y_{i,k} &= \alpha_k + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \varepsilon_i \\ X_{i,k} &= \delta_k + \gamma_k Z_{i,k} + u_{i,k} \end{align} $$ The \(X\)s are endogenous regressors and the \(Z\)s are instruments. We are interested in a two-stage-least-squares regression of \(Y\) on \(X\) using the exogenous variation of the \(Z\).

Setting up the problem #

We set up the problem in julia (v1.6.0) and provide a minimal reproducible example. We start by setting up the packages and creating a small dataset (20 rows) with two identifiers.

First we import the packages (sorry no checkpointing here apart from for the core package).

If you want to run this in isolation you can create a new directory and run it from there with julia --project=. and then activate the directory using Pkg; Pkg.activate("."); Pkg.instantiate()

import Pkg;
Pkg.add(name="FixedEffectModels", version="1.4.2"); # this new version has the fix
Pkg.add(name="Vcov", version="0.4.2");              # this new version has the fix

# if you want to run the version without the fix (mimicks stata)
# Pkg.add(name="Vcov", version="1.4.0"); 
# Pkg.add(name="Vcov", version="0.4.0"); # if you want to run the version without the fix (mimicks stata)

Pkg.add(["Revise", "CategoricalArrays", "DataFrames", "Random", "CSV", "Suppressor"]);
Pkg.add(["Plots", "PGFPlotsX", "LaTeXStrings"]);
Pkg.add(["FixedEffectModels", "RegressionTables"]);
Pkg.add("RCall");
Pkg.add(url="https://github.com/jmboehm/StataCall.jl#master");

And then load them for our session

using Printf
using Revise, Suppressor
using CategoricalArrays, DataFrames, CSV, Random
using Plots, LaTeXStrings; pgfplotsx(size=(600, 400));
using FixedEffectModels
using StataCall, RegressionTables
using RCall

Generating the dataset #

The following function generates the dataset. We introduce a small \(\epsilon \) to add noise to the instrument. This will be useful later as we show that the problem fails locally but not once we perturb the data.

function gen_df(ϵ; seed::Int=107, N_rows=10)
  Random.seed!(seed)
  N_id   = 2;
  i = 1
  df_example = DataFrame()
  for i in 1:N_id
    obs = 1:N_rows
    id = i .* Int.(ones(N_rows))
    Z = rand(N_rows)
    X = Z + 0.5 .* rand(N_rows)
    Y = 3 .* Z + 0.3 .* rand(N_rows)
    df_tmp = DataFrame(obs=obs, id = id, Z = Z, X = X, Y = Y)
    df_example = vcat(df_example, df_tmp)
  end  

  df_example1 = unstack(select(df_example, :obs, :id, :X), [:obs, :id, :X], :id, :X, renamecols=x->Symbol(:X, Int(x)))
  df_example2 = unstack(select(df_example, :obs, :id, :Z), [:obs, :id, :Z], :id, :Z, renamecols=x->Symbol(:Z, Int(x)))

  df_example = innerjoin(df_example, select(df_example1, :obs, :id, :X1, :X2), on = [:obs, :id])
  df_example = innerjoin(df_example, select(df_example2, :obs, :id, :Z1, :Z2), on = [:obs, :id])

  df_example[ ismissing.(df_example.X1), :X1] .= 0.0;
  df_example[ ismissing.(df_example.X2), :X2] .= 0.0;
  df_example[ ismissing.(df_example.Z1), :Z1] .= 0.0;
  df_example[ ismissing.(df_example.Z2), :Z2] .= 0.0;

  transform!(df_example, :id => categorical => :id)
  transform!(df_example, :Z1 => (x-> x .+ ϵ .* rand(N_id*N_rows)) => :Z1eps)
  transform!(df_example, :Z2 => (x-> x .+ ϵ .* rand(N_id*N_rows)) => :Z2eps)

  return(df_example)

end;

And you can generate and view the data as follows:

$ df_example = gen_df(0.01, seed=107, N_rows=10);

$ show(df_example, display_size=(19, 200))
20×11 DataFrame
 Row │ obs    id    Z         X         Y         X1        X2        Z1        Z2        Z1eps       Z2eps
Int64  Cat…  Float64   Float64   Float64   Float64?  Float64?  Float64?  Float64?  Float64     Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
   11  1     0.119228  0.478444  0.587155  0.478444  0.0       0.119228  0.0       0.121428    3.31773e-5
   22  1     0.26591   0.568968  0.938078  0.568968  0.0       0.26591   0.0       0.272448    0.000209308
   33  1     0.745445  1.01965   2.25625   1.01965   0.0       0.745445  0.0       0.749129    0.00955548
   44  1     0.174331  0.208681  0.746424  0.208681  0.0       0.174331  0.0       0.178839    0.00739797
   55  1     0.370533  0.759199  1.20848   0.759199  0.0       0.370533  0.0       0.375502    0.00863599
                                                                                          
  166  2     0.674156  0.891977  2.10047   0.0       0.891977  0.0       0.674156  0.00662125  0.676856
  177  2     0.568036  0.787749  1.9021    0.0       0.787749  0.0       0.568036  0.00465503  0.57743
  188  2     0.935972  1.03212   3.03463   0.0       1.03212   0.0       0.935972  0.00264424  0.939824
  199  2     0.719989  0.927739  2.37524   0.0       0.927739  0.0       0.719989  0.00676318  0.729181
  2010  2     0.840836  0.973702  2.75698   0.0       0.973702  0.0       0.840836  0.00371209  0.849899
                                                                                                   10 rows omitted

Running the regression: julia #

We run two sets of regressions. These ones have the bug fix and should run correctly.

The first regression is the one described above. The second one uses the noisy instruments rather than the original instruments. The noisy instruments are defined as $$ Z_{i,k}^{\epsilon} = Z_{i,k} + \epsilon \cdot \text{Uniform} $$ What is interesting is how the Kleibergen-Paap statistics stay close as we perturb the data slightly.

r1 = reg(df_example, @formula(Y ~ fe(id) + (X1 + X2 ~ Z1 + Z2) ), Vcov.robust() );
r2 = reg(df_example, @formula(Y ~ fe(id) + (X1 + X2 ~ Z1eps + Z2eps) ), Vcov.robust() );

And to view the results

$ regtable(r1,r2; regression_statistics = [:f_kp])

---------------------------------------------
                                   Y
                          -------------------
                               (1)        (2)
---------------------------------------------
X1                        3.415***   3.395***
                           (0.771)    (0.763)
X2                        4.129***   4.135***
                           (0.474)    (0.475)
---------------------------------------------
id Fixed Effects               Yes        Yes
---------------------------------------------
First-stage F statistic     12.231     12.432
---------------------------------------------

To check the smoothness of the statistic and that the knife-edge case has actually been fixed, we look at how the F-stat varies for different values of the perturbation.

ϵ_vec = range(-0.025, 0.025, length=25)
F_vec = similar(ϵ_vec)
iter_ϵ = 1

for iter_ϵ in 1:length(ϵ_vec)
    df_example = gen_df(ϵ_vec[iter_ϵ], seed=107, N_rows=10)
    r_tmp = reg(df_example, @formula(Y ~ fe(id) + (X1 + X2 ~ Z1eps + Z2eps) ), Vcov.robust() )
    F_vec[iter_ϵ] = r_tmp.F_kp
end

plot(ϵ_vec, F_vec, 
     xlabel=L"$\epsilon$ Perturbation of Instrument Z",
     ylabel="Kleibergen-Paap first stage F-stat", legend=false,
     dpi=300);
plot!([0.0], seriestype="vline", color=:black)

Running the regression: Stata #

We use the StataCall package to run commands in stata. To set it up you need to make sure that julia has access to your path, see the repo for more details on how to do it for your system.

It might be easier to save the file and then load directly in stata.

df_example = gen_df(0.01, seed=107, N_rows=10);
CSV.write("./stata_data.csv", df_example);

If you go the StataCall route, you will simply pass the regression commands as strings:

dfOut = StataCall.stataCall(
    ["ivreg2 Y (X1 X2 = Z1 Z2) id, robust"; "gen F = e(rkf)";
     "ivreg2 Y (X1 X2 = Z1eps Z2eps) id, robust"; "gen Feps = e(rkf)";   ], 
    df_example, true, true, true); # replace the last argument by false to see the stata-log
Float64(dfOut.F[1]), Float64(dfOut.Feps[1]); # returns the F-stats 

We can confirm that the results are not smooth

$ msg =  "\nF-statistics estimated with ivreg2 and exact instruments: F = " * @sprintf("%.2f", dfOut.F[1]) *
       "\nF-statistics estimated with ivreg2 and ϵ-perturbed instruments: F = " * @sprintf("%.2f", dfOut.Feps[1]);
$ @info msg
┌ Info:
│ F-statistics estimated with ivreg2 and exact instruments: F = 0.00
└ F-statistics estimated with ivreg2 and ϵ-perturbed instruments: F = 12.43

We find that the result is 0 for the case with the actual instrument, which is not the expected result. We also checked whether the trick of perturbing the instruments might help and confirm the results above found in the julia code: the F-stat is 12.43 which corresponds to the desired estimate.

We try using ivreghdfe which suffers from similar issues. Note that ivreghdfe at least throws a warning (about collinearity) and does not return a number for the statistic. The peturbation still works well.

dfOut = StataCall.stataCall(
    ["ivreghdfe Y (X1 X2 = Z1 Z2), absorb(id) robust"; "gen F = e(rkf)";
     "ivreghdfe Y (X1 X2 = Z1eps Z2eps), absorb(id) robust"; "gen Feps = e(rkf)";   ], 
    df_example, true, true, true); # replace the last argument by false to see the stata-log
Fout = map(x-> ismissing(x) ? string(x) : @sprintf("%.2f", x), [dfOut.F[1], Float64(dfOut.Feps[1])] ) # returns the F-stats

Same as before:

$ msg =  "\nF-statistics estimated with ivreghdfe and exact instruments: F = " * Fout[1] *
       "\nF-statistics estimated with ivreghdfe and ϵ-perturbed instruments: F = " * Fout[2] 
$ @info msg
┌ Info:
│ F-statistics estimated with ivreghdfe and exact instruments: F = 0.00
└ F-statistics estimated with ivreghdfe and ϵ-perturbed instruments: F = 12.43

The fix to the covariance estimator only consisted of a few lines to handle a special case. You can see the commit here.

Ideally this would be implemented in the stata code as well.

Running the regression: R #

We use the RCall package to run commands in R within julia just as with StataCall (to run it make sure that your R environment is in the path, and of course install the relevant packages).

# Load all relevant libraries to R (make sure they are installed in your R)
@rput df_example
@suppress begin
R"""   
    library(data.table, verbose=T)
    library(lfe, verbose=T)
    library(fixest, verbose=T)
    library(texreg, verbose=T)
    dt_reg = data.table(df_example);
    dt_reg[, id := as.factor(id)];
"""
end
RObject{VecSxp}

It is hard to make sense of the results of felm here (note that I included id clustering, otherwise the estimation throws an error probably because of an zero inversion).

# @suppress begin
R"""
     est_felm      <- felm(Y ~ X1 + X2 | id | (X1 + X2 ~ Z1 + Z2) | id, dt_reg)
     est_felm_eps  <- felm(Y ~ X1 + X2 | id | (X1 + X2 ~ Z1eps + Z2eps) | id, dt_reg)
     # res_s = screenreg(list(est_felm, est_felm_eps), include.fstatistic = T)     
"""
end
@rget res_s
print(res_s)

The feols function detects collinearity in X1 and X2 which I do not understand, and stops there. Edit: this seems to gives us the correct results here.

@suppress begin
R"""
    est_feols     <- feols(Y ~ 1 | id | X1 + X2 ~ Z1 + Z2, dt_reg)
    est_feols_eps <- feols(Y ~ 1 | id | X1 + X2 ~ Z1eps + Z2eps, dt_reg)
"""
# or run tests
R"""
    fitstat(est_feols, "ivf")
    fitstat(est_feols_eps, "ivf")
    fitstat(est_feols, "ivwald")
    fitstat(est_feols_eps, "ivwald")
"""
end

R"""
    print(est_feols)
    print(est_feols_eps)
"""

Perhaps we are missing other relevant packages that would take care of this issue (AER or estimatr). However these two are the fastest and most prevalent when dealing with relatively high-dimensional data. At this point it is not clear how to deal with this type of regressions in R.



Code, Statistics
Julia, Statistics