Instrumental Variables: Two Stage Least Squares in Python

Last Update: March 24, 2022

Instrumental Variables: Two Stage Least Squares in Python can be done using linearmodels package IV2SLS function found within linearmodels.iv.model module for estimating linear regression with independent variables which are correlated with error term (endogenous). Main parameters within IV2SLS function are dependent with model dependent variable, exog with model exogenous independent variable, endog with model endogenous independent variable and instruments with model instrumental variables.

As example, we can compare estimated coefficients tables and F-statistics from original multiple linear regression of house price explained by its lot size and number of bedrooms and second stage least squares multiple linear regression of house price explained by its lot size first stage multiple linear regression fitted values and number of bedrooms with whether house has a driveway and number of garage places as instrumental variables using data included within AER R package HousePrices object [1].

First, we import packages statsmodels for data downloading and ordinary least squares original model fitting and linearmodels for two stage least squares model fitting [2].

In [1]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import linearmodels.iv.model as lm

Second, we create houseprices data object using get_rdataset function and display first five rows and first three columns together with sixth and eleventh columns of data using print function and head data frame method to view its structure.

In [2]:
houseprices = sm.datasets.get_rdataset(dataname="HousePrices", package="AER", cache=True).data
print(houseprices.iloc[:, list(range(3)) + [5] + [10]].head())

Out [2]:
     price  lotsize  bedrooms driveway  garage
0  42000.0     5850         3      yes       1
1  38500.0     4000         2      yes       0
2  49500.0     3060         3      yes       0
3  60500.0     6650         3      yes       0
4  61000.0     6360         2      yes       0

Third, we fit original model with ols function using variables within houseprices data object and store outcome within mlr1 object. Within ols function, parameter formula = “price ~ lotsize + bedrooms” fits model where house price is explained by its lot size and number of bedrooms.

In [3]:
mlr1 = smf.ols(formula="price ~ lotsize + bedrooms", data=houseprices).fit()

Fourth, we create mdatac model data object and add a constant column using add_constant function. Within add_constant function, parameters data=houseprices includes houseprices data object and prepend=False includes logical value to add constant at last column of mdatac data object. Then, we fit two stage least squares model with IV2SLS function using variables within mdatac data object and store outcome within mlr2 object. Within IV2SLS function, parameters dependent=mdatac["price"] includes model house price dependent variable, exog=mdatac[["const", "bedrooms"]] includes model number of bedrooms exogenous independent variable, endog=mdatac["lotsize"] includes model lot size endogenous independent variable, instruments=mdatac[["driveway", "garage"]] includes model whether house has a driveway and number of garage places instrumental variables, cov_type="homoskedastic" includes model homoskedastic variance covariance matrix estimation and debiased=True includes logical value to adjust model variance covariance matrix estimation for degrees of freedom. Notice that IV2SLS function parameters cov_type="homoskedastic" and debiased=True were only included as educational examples which can be modified according to your needs. Also, notice that doing stage by stage instead of simultaneous stages estimation of two stage least squares model with ols function would estimate correct coefficients but incorrect standard errors and F-statistic. Additionally, notice that two stage least squares mlr2 model estimation assumes errors are homoskedastic.

In [4]:
mdatac = sm.add_constant(data=houseprices, prepend=False)
mlr2 = lm.IV2SLS(dependent=mdatac["price"], exog=mdatac[["const", "bedrooms"]], endog=mdatac["lotsize"], instruments=mdatac[["driveway", "garage"]]).fit(cov_type="homoskedastic", debiased=True)

Fifth, we can print mlr1 model estimated coefficients table using its summary method and selecting its second tables attribute.

In [5]:
print(mlr1.summary().tables[1])

Out [5]:
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   5612.5997   4102.819      1.368      0.172   -2446.741    1.37e+04
lotsize        6.0530      0.424     14.265      0.000       5.219       6.887
bedrooms    1.057e+04   1247.676      8.470      0.000    8116.488     1.3e+04
==============================================================================

Sixth, we can print mlr2 model estimated coefficients table using its summary method and selecting its second tables attribute.

In [6]:
print(mlr2.summary.tables[1])

Out [6]:
                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const      -1.913e+04     6540.7    -2.9248     0.0036  -3.198e+04     -6282.0
bedrooms       7680.1     1574.1     4.8791     0.0000      4588.1   1.077e+04
lotsize        12.519     1.2400     10.096     0.0000      10.084      14.955
==============================================================================

Seventh, we can print mlr1 model estimated F-statistic and associated p-value using its fvalue and f_pvalue properties.

In [7]:
print(mlr1.fvalue, mlr1.f_pvalue)

Out [7]:
159.636704936077 2.954866733815849e-55

Eighth, we can print mlr2 model estimated F-statistic and associated p-value using its f_statistic attribute.

In [8]:
print(mlr2.f_statistic)

Out [8]:
Model F-statistic
H0: All parameters ex. constant are zero
Statistic: 91.5197
P-value: 0.0000
Distributed: F(2,543)

Courses

My online courses are hosted at Teachable website.

For more details on this concept, you can view my Linear Regression in Python Course.

References

[1] Data Description: Sales prices of houses sold in the city of Windsor, Canada, during July, August and September, 1987.

Original Source: Anglin, P., and Gencay, R. (1996). Semiparametric Estimation of a Hedonic Price Function. Journal of Applied Econometrics, 11, 633–648.

[2] statsmodels Python package: Seabold, Skipper, and Josef Perktold. (2010). “statsmodels: Econometric and statistical modeling with python”. Proceedings of the 9th Python in Science Conference.

linearmodels Python package: Kevin Sheppard. (2021). “Linear (regression) models for Python. Extends statsmodels with Panel regression, instrumental variable estimators, system estimators and models for estimating asset prices”. Python package version 4.25.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.