Last Update: February 21, 2022
Coefficient of Determination in Python can be estimated using statsmodels
package ols
function, its summary
method and rsquared
, rsquared_adj
properties found within statsmodels.formula.api
module to fit linear regression, print its summary results and estimated coefficients of determination. Main parameters within ols
function are formula
with “y ~ x1 + … + xp”
model description string and data
with data frame object including model variables.
As example, we can estimate coefficients of multiple determination from multiple linear regression of house price explained by its lot size and number of bedrooms using data included within AER
R package HousePrices
object [1].
First, we import package statsmodels
for data downloading and model fitting [2].
In [1]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
Second, we create houseprices
data object using get_rdataset
function and display first five rows and three columns of data using print
function and head
data frame method to view its structure.
In [2]:
houseprices = sm.datasets.get_rdataset(dataname="HousePrices", package="AER", cache=True).data
print(houseprices.iloc[:, 0:3].head())
Out [2]:
price lotsize bedrooms
0 42000.0 5850 3
1 38500.0 4000 2
2 49500.0 3060 3
3 60500.0 6650 3
4 61000.0 6360 2
Third, we fit model with ols
function using variables within houseprices
data object and store outcome within mlr
object. Within ols
function, parameter formula = “price ~ lotsize + bedrooms”
fits model where house price is explained by its lot size and number of bedrooms.
In [3]:
mlr = smf.ols(formula="price ~ lotsize + bedrooms", data=houseprices).fit()
Fourth, we can print mlr
model summary results which include estimated coefficients of multiple determination using its summary
method.
In [4]:
print(mlr.summary())
Out [4]:
OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.370
Model: OLS Adj. R-squared: 0.368
Method: Least Squares F-statistic: 159.6
Date: Wed, 25 Aug 2021 Prob (F-statistic): 2.95e-55
Time: 18:41:02 Log-Likelihood: -6213.1
No. Observations: 546 AIC: 1.243e+04
Df Residuals: 543 BIC: 1.245e+04
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 5612.5997 4102.819 1.368 0.172 -2446.741 1.37e+04
lotsize 6.0530 0.424 14.265 0.000 5.219 6.887
bedrooms 1.057e+04 1247.676 8.470 0.000 8116.488 1.3e+04
==============================================================================
Omnibus: 77.789 Durbin-Watson: 1.193
Prob(Omnibus): 0.000 Jarque-Bera (JB): 146.854
Skew: 0.833 Prob(JB): 1.29e-32
Kurtosis: 4.919 Cond. No. 2.60e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.6e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Fifth, we can also print mlr
model estimated coefficients of determination using its rsquared
and rsquared_adj
properties.
In [5]:
print(mlr.rsquared)
Out [5]:
0.37026934405815837
In [6]:
print(mlr.rsquared_adj)
Out [6]:
0.3679498941283542
Courses
My online courses are hosted at Teachable website.
For more details on this concept, you can view my Linear Regression in Python Course.
References
[1] Data Description: Sales prices of houses sold in the city of Windsor, Canada, during July, August and September, 1987.
Original Source: Anglin, P., and Gencay, R. (1996). Semiparametric Estimation of a Hedonic Price Function. Journal of Applied Econometrics, 11, 633–648.
[2] Seabold, Skipper, and Josef Perktold. (2010). “statsmodels: Econometric and statistical modeling with python.” Proceedings of the 9th Python in Science Conference.