Last Update: February 21, 2022
Homogeneity of Regression Slopes: Dummy Variables in Python can be done using statsmodels
package wald_test
function found within statsmodels.formula.api
module for evaluating whether linear regression intercept and slopes are homogeneous across populations.
As example, we can do homogeneity Wald test from unrestricted multiple linear regression of house prices explained by its lot size, number of bedrooms and air conditioning as dummy independent variable using data included within AER
R package HousePrices
object [1].
First, we import statsmodels
package for data downloading, multiple linear regression fitting and Wald test [2].
In [1]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
Second, we create houseprices
data object using get_rdataset
function and display first five rows, first three columns and tenth column of data using print
function and head
data frame method to view its structure.
In [2]:
houseprices = sm.datasets.get_rdataset(dataname="HousePrices", package="AER", cache=True).data
print(houseprices.iloc[:, list(range(3)) + [9]].head())
Out [2]:
price lotsize bedrooms aircon
0 42000.0 5850 3 no
1 38500.0 4000 2 no
2 49500.0 3060 3 no
3 60500.0 6650 3 no
4 61000.0 6360 2 no
Third, as example again, we fit unrestricted multiple linear regression with ols
function using variables within houseprices
data object, store results within mlr
object and print mlr
object summary results using its summary
method. Within ols
function, parameter formula="price ~ lotsize + bedrooms + aircon + lotsize*aircon + bedrooms*aircon"
fits unrestricted model where house price is explained by its lot size, number of bedrooms and air conditioning as dummy independent variable. Notice that ols
function parameter formula
can also be formula="price ~ lotsize*aircon + bedrooms*aircon"
because it automatically includes lotsize
, bedrooms
, aircon
individual independent variables and their lotsize*aircon
, bedrooms*aircon
products within model equation. Also, notice that ols
function automatically converts aircon
variable yes
category into 1
numeric value and no
category into 0
numeric value. Additionally, notice that aircon
dummy independent variable was only included as educational example which can be modified according to your needs.
In [3]:
mlr = smf.ols(formula="price ~ lotsize + bedrooms + aircon + lotsize*aircon + bedrooms*aircon", data=houseprices).fit()
print(mlr.summary())
Out [3]:
OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.478
Model: OLS Adj. R-squared: 0.474
Method: Least Squares F-statistic: 99.09
Date: Sat, 23 Oct 2021 Prob (F-statistic): 5.14e-74
Time: 13:33:49 Log-Likelihood: -6161.6
No. Observations: 546 AIC: 1.234e+04
Df Residuals: 540 BIC: 1.236e+04
Df Model: 5
Covariance Type: nonrobust
==========================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------
Intercept 1.536e+04 4263.209 3.603 0.000 6985.598 2.37e+04
aircon[T.yes] -1.423e+04 9434.410 -1.509 0.132 -3.28e+04 4297.895
lotsize 4.6206 0.466 9.915 0.000 3.705 5.536
lotsize:aircon[T.yes] 2.4380 0.882 2.763 0.006 0.705 4.171
bedrooms 7709.3160 1326.284 5.813 0.000 5104.008 1.03e+04
bedrooms:aircon[T.yes] 6125.1574 2661.132 2.302 0.022 897.718 1.14e+04
==============================================================================
Omnibus: 81.680 Durbin-Watson: 1.431
Prob(Omnibus): 0.000 Jarque-Bera (JB): 182.492
Skew: 0.807 Prob(JB): 2.36e-40
Kurtosis: 5.328 Cond. No. 7.30e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.3e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Fourth, we do Wald test using wald_test
function, store results within waldtest
object and print its results. Within wald_test
function, parameters r_matrix="aircon[T.yes] = lotsize:aircon[T.yes] = bedrooms:aircon[T.yes] = 0"
includes air conditioning dummy independent variable and air conditioning dummy independent variable products with lot size and bedrooms independent variables coefficients joint null hypothesis string and use_f=True
does F-test. Notice that unrestricted mlr
model results and wald_test
function parameter use_f=True
were only included as educational examples which can be modified according to your needs.
In [4]:
waldtest = mlr.wald_test(r_matrix="aircon[T.yes] = lotsize:aircon[T.yes] = bedrooms:aircon[T.yes] = 0", use_f=True)
print(waldtest)
Out [4]:
<F test: F=array([[37.35040103]]), p=6.030790422224445e-22, df_denom=540, df_num=3>
Courses
My online courses are hosted at Teachable website.
For more details on this concept, you can view my Linear Regression in Python Course.
References
[1] Data Description: Sales prices of houses sold in the city of Windsor, Canada, during July, August and September, 1987.
Original Source: Anglin, P., and Gencay, R. (1996). Semiparametric Estimation of a Hedonic Price Function. Journal of Applied Econometrics, 11, 633–648.
[2] statsmodels Python package: Seabold, Skipper, and Josef Perktold. (2010). “statsmodels: Econometric and statistical modeling with python.” Proceedings of the 9th Python in Science Conference.