# Instrumental Variables: Two Stage Least Squares in R

Last Update: March 24, 2022

Instrumental Variables: Two Stage Least Squares in R can be done using `AER` package `ivreg` function for estimating linear regression with independent variables which are correlated with error term (endogenous). Main parameters within `ivreg` function are `formula` with `y ~ x1 + x2 | x2 + z1 + z2` original model with `x1` endogenous independent variable and `x2` exogenous independent variable followed by first stage least squares model with `x2` exogenous independent variable, `z1` and `z2` instrumental variables description and `data` with `data.frame` object including models variables.

As example, we can compare estimated coefficients tables and F-statistics from original multiple linear regression of house price explained by its lot size and number of bedrooms and second stage least squares multiple linear regression of house price explained by its lot size first stage multiple linear regression fitted values and number of bedrooms with whether house has a driveway and number of garage places as instrumental variables using data included within `AER` package `HousePrices` object .

First, we load package `AER` for data and two stage least squares estimation .

``````In :
library(AER)
``````

Second, we create `HousePrices` data object from `AER` package using `data` function and print first six rows, first three columns together with sixth and eleventh columns of data using `head` function to view `data.frame` structure.

``````In :
data(HousePrices)
``````
``````Out :
price lotsize bedrooms driveway garage
1 42000    5850        3      yes      1
2 38500    4000        2      yes      0
3 49500    3060        3      yes      0
4 60500    6650        3      yes      0
5 61000    6360        2      yes      0
6 66000    4160        3      yes      0
``````

Third, we fit original model with `lm` function using variables within `HousePrices` data object and store outcome within `mlr1` object. Within `lm` function, parameter `formula = price ~ lotsize + bedrooms` fits original model where house price is explained by its lot size and number of bedrooms.

``````In :
mlr1 <- lm(formula = price ~ lotsize + bedrooms, data = HousePrices)
``````

Fourth, we fit two stage least squares model with `ivreg` function using variables within `HousePrices` data object and store outcome within `mlr2` object. Within `ivreg` function, parameter `formula = price ~ lotsize + bedrooms | bedrooms + driveway + garage` fits original model where house price is explained by its lot size endogenous independent variable and number of bedrooms exogenous independent variable followed by first stage least squares model number of bedrooms exogenous independent variable, whether house has a driveway and number of garage places instrumental variables. Notice that doing stage by stage instead of simultaneous stages estimation of two stage least squares model with `lm` function would estimate correct coefficients but incorrect standard errors and F-statistic.

``````In :
mlr2 <- ivreg(formula = price ~ lotsize + bedrooms | bedrooms + driveway + garage, data = HousePrices)
``````

Fifth, we get `mlr1` model summary results with `summary` function and store outcome within `smlr1` object. Within `summary` function, parameter `object = mlr1` includes `mlr1` model results. Then, we get `mlr2` model summary results with `summary` function for `ivreg` and store outcome within `smlr2` object. Within `summary` function for `ivreg`, parameters `object = mlr2` includes `mlr2` model results and `test = "F"` includes string to do an F-test. Notice that `summary` function for `ivreg` parameter `test = "F"` was only included as educational example which can be modified according to your needs. Also, notice that two stage least squares `mlr2` model estimation assumes errors are homoskedastic unless heteroskedasticity consistent variance covariance matrix estimation is used within `summary` function for `ivreg`.

``````In :
smlr1 <- summary(object = mlr1)
smlr2 <- summary(object = mlr2, test = "F")
``````

Sixth, we print `mlr1` model estimated coefficients table using its `coefficients` value.

``````In :
smlr1\$coefficients
``````
``````Out :
Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)  5612.599731 4102.8189131  1.367986 1.718822e-01
lotsize         6.053022    0.4243331 14.264788 1.938847e-39
bedrooms    10567.351501 1247.6764642  8.469625 2.314456e-16
``````

Seventh, we print `mlr2` model estimated coefficients table using its `coefficients` value.

``````In :
smlr2\$coefficients
``````
``````Out :
Estimate Std. Error   t value     Pr(>|t|)
(Intercept) -19130.15709   6540.667 -2.924802 3.590757e-03
lotsize         12.51948      1.240 10.096348 4.417073e-22
bedrooms      7680.12883   1574.086  4.879105 1.402506e-06
attr(,"df")
 543
attr(,"nobs")
 546
``````

Eighth, we print `mlr1` model F-statistic using its `fstatistic` value.

``````In :
smlr1\$fstatistic
``````
``````Out :
value    numdf    dendf
159.6367   2.0000 543.0000
``````

Ninth, we print `mlr2` model F-statistic using its `waldtest` value.

``````In :
smlr2\$waldtest
``````
``````Out :
 9.151967e+01 5.590784e-35 2.000000e+00 5.430000e+02
``````

Courses

My online courses are hosted at Teachable website.

For more details on this concept, you can view my Linear Regression in R Course.

 Data Description: Sales prices of houses sold in the city of Windsor, Canada, during July, August and September, 1987.

Original Source: Anglin, P., and Gencay, R. (1996). Semiparametric Estimation of a Hedonic Price Function. Journal of Applied Econometrics, 11, 633–648.

 AER R Package. Christian Kleiber and Achim Zeileis. (2008). Applied Econometrics with R. Springer-Verlag, New York.

+