# Multicollinearity: Variance Inflation Factor in Python

Last Update: February 21, 2022

Multicollinearity in Python can be tested using `statsmodels` package `variance_inflation_factor` function found within `statsmodels.stats.outliers_influence` module for estimating multiple linear regression independent variables variance inflation factors individually. Main parameters within `variance_inflation_factor` function are `exog` with matrix of independent variables values and `exog_idx` with column index of independent variable for its variance inflation factor estimation. Independent variables variance inflation factors can also be estimated as main diagonal values from their inverse correlation matrix using `numpy` package `inv` function found within `numpy.linalg` module. Main parameter within `inv` function is `a` with matrix to be inverted.

As example, we can test multicollinearity of independent variables from multiple linear regression of house price explained by its lot size, number of bedrooms, bathrooms and stories using data included within `AER` R package `HousePrices` object .

First, we import packages `numpy` for estimating inverse correlation matrix, `pandas` for data frame objects creation, `seaborn` for inverse correlation matrix chart and `statsmodels` for data downloading, adding constant column to independent variables data frame and estimating variance inflation factors individually .

``````In :
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.tools.tools as smt
import statsmodels.stats.outliers_influence as smo
``````

Second, we create `houseprices` data object using `get_rdataset` function and display first five rows and five columns of data using `print` function and `head` data frame method to view its structure.

``````In :
houseprices = sm.datasets.get_rdataset(dataname="HousePrices", package="AER", cache=True).data
``````
``````Out :
price  lotsize  bedrooms  bathrooms  stories
0  42000.0     5850         3          1        2
1  38500.0     4000         2          1        1
2  49500.0     3060         3          1        1
3  60500.0     6650         3          1        2
4  61000.0     6360         2          1        1
``````

Third, we create independent variables data frame subset stored within `ivar` object and display first five rows of data using `print` function and `head` data frame method to view its structure.

``````In :
ivar = houseprices.iloc[:, 1:5]
``````
``````Out :
lotsize  bedrooms  bathrooms  stories
0     5850         3          1        2
1     4000         2          1        1
2     3060         3          1        1
3     6650         3          1        2
4     6360         2          1        1
``````

Fourth, we add a constant column to independent variables data frame using `add_constant` function and store outcome within `ivarc` object. Within `add_constant` function, parameters `data=ivar` includes matrix of independent variables and `prepend=False` includes logical value to add constant at last column of matrix. Next, we can print independent variables estimated variance inflation factors individually using `variance_inflation_factor` function. Within `variance_inflation_factor` function, parameters `exog=ivarc.values` includes matrix of independent variables values and `exog_idx=0` includes column index of `lotsize` independent variable for its variance inflation factor estimation as an example.

``````In :
vif_lotsize = smo.variance_inflation_factor(exog=ivarc.values, exog_idx=0)
print(vif_lotsize)
``````
``````Out :
1.047054041442195
``````

Fifth, we can print independent variables estimated variance inflation factors as main diagonal values from their inverse correlation matrix using `inv` function and store outcome within `ivaricor` object as `DataFrame`. Within `inv` function, parameter `a = ivar.corr()` includes independent variables estimated correlation matrix using `corr` data frame function. Within `DataFrame` function, parameters `data=ivaricor` includes independent variables previously estimated inverse correlation matrix, `index=ivar.columns` includes matrix rows names and `columns=ivar.columns` includes matrix columns names.

``````In :
ivaricor = np.linalg.inv(a = ivar.corr())
ivaricor = pd.DataFrame(data=ivaricor, index=ivar.columns, columns=ivar.columns)
print(ivaricor)
``````
``````Out :
lotsize  bedrooms  bathrooms   stories
lotsize    1.047054 -0.099092  -0.168300  0.007355
bedrooms  -0.099092  1.310851  -0.335344 -0.417828
bathrooms -0.168300 -0.335344   1.239203 -0.250689
stories    0.007355 -0.417828  -0.250689  1.251087
``````

Sixth, we can additionally visualize independent variables estimated variance inflation factors as main diagonal values from their inverse correlation matrix chart using `seaborn` package `heatmap` function. Within `heatmap` function, parameters `data=ivaricor` includes matrix to visualize, `cmap="Blues"` includes `matplotlib` package colormap name and `annot=True` includes logical value to write the data value in each cell.

``````In :
sns.heatmap(data=ivaricor, cmap="Blues", annot=True)
``````
``Out :``

Courses

My online courses are hosted at Teachable website.

For more details on this concept, you can view my Linear Regression in Python Course.

 Data Description: Sales prices of houses sold in the city of Windsor, Canada, during July, August and September, 1987.

Original Source: Anglin, P., and Gencay, R. (1996). Semiparametric Estimation of a Hedonic Price Function. Journal of Applied Econometrics, 11, 633–648.

 numpy Python package: Travis E. Oliphant, et al. (2020). Array programming with NumPy. Nature, 585, 357–362.

pandas Python package: Wes McKinney. (2010). Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51-56.

seaborn Python package: Waskom, M. L., (2021). “seaborn: statistical data visualization”. Journal of Open Source Software, 6(60), 3021.

statsmodels Python package: Seabold, Skipper, and Josef Perktold. (2010). “statsmodels: Econometric and statistical modeling with python.” Proceedings of the 9th Python in Science Conference.

+