Last Update: February 21, 2022
Linear Regression: Analysis of Variance ANOVA Table in Python can be done using statsmodels package anova_lm function found within statsmodels.api.stats module for analyzing dependent variable total variance together with its two components regression variance or explained variance and residual variance or unexplained variance. It is also used for evaluating whether adding independent variables improved linear regression model. Main parameters within anova_lm function are args with constant or intercept only linear regression and linear regression to be evaluated fitted models results, test with test statistics included and typ with ANOVA test type.
As example, we can print ANOVA table from multiple linear regression of house price explained by its lot size and number of bedrooms using data included within AER R package HousePrices object [1].
First, we import statsmodels package for data downloading, multiple linear regression fitting and ANOVA table estimation [2].
In [1]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
Second, we create houseprices data object using get_rdataset function and display first five rows and first three columns of data using print function and head data frame method to view its structure.
In [2]:
houseprices = sm.datasets.get_rdataset(dataname="HousePrices", package="AER", cache=True).data
print(houseprices.iloc[:, 0:3].head())
Out [2]:
price lotsize bedrooms
0 42000.0 5850 3
1 38500.0 4000 2
2 49500.0 3060 3
3 60500.0 6650 3
4 61000.0 6360 2
Third, we fit multiple linear regression with ols function using variables within houseprices data object, store results within mlr object and print mlr object summary results using its summary method. Within ols function, parameter formula="price ~ lotsize + bedrooms" fits model where house price is explained by its lot size and number of bedrooms.
In [3]:
mlr = smf.ols(formula="price ~ lotsize + bedrooms", data=houseprices).fit()
print(mlr.summary())
Out [3]:
OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.370
Model: OLS Adj. R-squared: 0.368
Method: Least Squares F-statistic: 159.6
Date: Mon, 08 Nov 2021 Prob (F-statistic): 2.95e-55
Time: 19:08:52 Log-Likelihood: -6213.1
No. Observations: 546 AIC: 1.243e+04
Df Residuals: 543 BIC: 1.245e+04
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 5612.5997 4102.819 1.368 0.172 -2446.741 1.37e+04
lotsize 6.0530 0.424 14.265 0.000 5.219 6.887
bedrooms 1.057e+04 1247.676 8.470 0.000 8116.488 1.3e+04
==============================================================================
Omnibus: 77.789 Durbin-Watson: 1.193
Prob(Omnibus): 0.000 Jarque-Bera (JB): 146.854
Skew: 0.833 Prob(JB): 1.29e-32
Kurtosis: 4.919 Cond. No. 2.60e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.6e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Fourth, we fit constant or intercept only linear regression using ols function, store its results within lr1 object, estimate multiple linear regression ANOVA table using anova_lm function, store its results within anova object and print them. Within ols function, parameter formula="price ~ 1" fits constant or intercept only linear regression with house price as dependent variable because constant or intercept is a column of ones. Within anova_lm function, parameters test="F" does an F-test and typ="I" does ANOVA Type I test. Notice that anova_lm function parameters test="F" and typ="I" were only included as educational examples which can be modified according to your needs.
In [4]:
lr1 = smf.ols(formula="price ~ 1", data=houseprices).fit()
anova = sm.stats.anova_lm(lr1, mlr, test="F", typ="I")
print(anova)
Out [4]:
df_resid ssr df_diff ss_diff F Pr(>F)
0 545.0 3.886028e+11 0.0 NaN NaN NaN
1 543.0 2.447151e+11 2.0 1.438877e+11 159.636705 2.954867e-55
df_resid ssr df_diff ss_diff F Pr(>F)
0 df_tot ss_tot
1 df_res ss_res df_reg ss_reg f_stat f_pval
Table 1. Analysis of Variance Table Output Description.
Courses
My online courses are hosted at Teachable website.
For more details on this concept, you can view my Linear Regression in Python Course.
References
[1] Data Description: Sales prices of houses sold in the city of Windsor, Canada, during July, August and September, 1987.
Original Source: Anglin, P., and Gencay, R. (1996). Semiparametric Estimation of a Hedonic Price Function. Journal of Applied Econometrics, 11, 633–648.
[2] statsmodels Python package: Seabold, Skipper, and Josef Perktold. (2010). “statsmodels: Econometric and statistical modeling with python.” Proceedings of the 9th Python in Science Conference.