All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Identification of multicollinearity and it’s effect in model selection

Electronic Journal of Applied Statistical AnalysisEJASA, Electron. J. App. Stat. Anal.
http://siba-ese.unisalento.it/index.php/ejasa/index
e-ISSN: 2070-5948DOI: 10.1285/i20705948v7n1p153
Identiﬁcation of Multicollinearity and it’s eﬀectin Model selection
By Jayakumar and Sulthan
April 26, 2014
This work is copyrighted by Universit`a del Salento, and is licensed un-der a
Creative Commons Attribuzione - Non commerciale - Non opere derivate3.0 Italia License
.For more information see:
http://creativecommons.org/licenses/by-nc-nd/3.0/it/
Electronic Journal of Applied Statistical Analysis
Vol. 7, Issue 1, 2014, 153-179DOI: 10.1285/i20705948v7n1p153
Identiﬁcation of Multicollinearity andit’s eﬀect in Model selection
D.S. Jayakumar
∗
and A. Sulthan
Jamal Institute of Management, Tiruchirappalli, India
April 26, 2014
Multicollinearity is the problem experienced by statisticians while evalu-ating a sample regression model. This paper explored the relationship be-tween the sample variance inﬂation factor (vif) and F-ratio, based on thiswe proposed an exact F-test for the identiﬁcation of multicollinearity and itovercomes the traditional procedures of rule of thumb. The authors criti-cally identiﬁed that the variance inﬂation factor not only inﬂates the vari-ance of the estimated regression coeﬃcient and it also inﬂates the residualerror variance of a given ﬁtted regression model in various level of inﬂation.Moreover, we also found a link between the problem of multicollinearity andits impact on the model selection decision. For this, the authors proposedmulticollinearity corrected version of generalized information criteria whichincorporates the eﬀect of multicollinearity and help the statisticians to selecta best model among the various competing models. This procedure numer-ically illustrated by ﬁtting 12 diﬀerent types of stepwise regression modelsbased on 44 independent variables in a BSQ (Bank service Quality) study.Finally.the study result shows the transition in model selection after the cor-rection of multicollinearity eﬀect.
keywords:
Multicollinearity, variance inﬂation factor, Error-variance, F-test, Generalized Information criteria, multicollinearity penalization.
1 Introduction and Related work
In the process of ﬁtting regression model, when one independent variable is nearly com-bination of other independent variables, there will aﬀect parameter estimates. This
∗
Corresponding author: samjaya77@gmail.com
c
Universit`a del SalentoISSN: 2070-5948http://siba-ese.unisalento.it/index.php/ejasa/index
154
Jayakumar and Sulthan
problem is called multicollinearity. Basically, multicollinearity is not a violation of theassumptions of regression but it may cause serious diﬃculties Neter et al (1989) (1)variances of parameter estimates may be unreasonably large, (2) parameter estimatesmay not be signiﬁcant, (3) a parameter estimate may have a sign diﬀerent from whatis expected and so on Efron (2004). For solving or alleviating this problem in cer-tain regression model, the usually best way is dropping redundant variables from thismodel directly, that is to try to avoid it by not including redundant variables in theregression model Bowerman et al (1993). But sometimes, it is hard to decide the re-dundant variables. Another alternative to deleting variables is to perform a principalcomponent analysis Maddala (1977). With principal component regression, we createa set of artiﬁcial uncorrelated variables that can then be used in the regression model.Although principal component variables are dropped from the model, when the modelis transformed back, it will cause other biases too Draper and Smith (1981) Srivas-tava (2002).The transformation of the independent variables and the methods applicableto overcome the multicollinearity problem discussed above purely depends on the exactidentiﬁcation of the problem. In this paper, the eﬀect identiﬁcation of the multicollinear-ity problem is discussed in a separate section and how it misleads the statisticians toselect a regression model based on the information criteria are visualized in the nextsection.
2 Inﬂation of error variance
Consider an estimated sample regression model with a single dependent variable (
y
i
)with
p
regressors namely
x
1
i
,x
2
i
,x
3
i
,...x
pi
is given as
y
i
=
α
+
p
j
=1
β
j
x
ji
+
e
i
(1)where
α
is the estimated Intercept,
β
j
is the estimated beta co-eﬃcients or partialregression co-eﬃcients and
e
i
is the estimated residual followed normal distribution N(0,
σ
2
e
).From (1), the sample regression model should satisfy the assumptions of normality,homoscedasticity of the error variance and the serial independence property. Thoughthe model satisfying all the assumptions, still it has to be evaluated. The authors moreparticularly focused on the multicollinearity and its eﬀects leads the strong inter causaleﬀect among the independent variables. For the past 5 decades, statisticians believethat the impact of this multicollinearity problem severely inﬂates the variance of theestimated regression co-eﬃcients. This creates greater instability and inconsistency inthe estimated co-eﬃcients. Besides this, we identiﬁed a remarkable and astonishingproblem due to the multicollinearity and it will be mathematically identiﬁed below.Consider the variance of the estimated regression co-eﬃcient as
σ
2
β
j
=
s
2
e
(
n
−
1)
s
2
x
j
( 11
−
R
2
x
j
) (2)
Electronic Journal of Applied Statistical Analysis
155Where
s
2
e
is the unbiased estimate of the error variance,
s
2
x
j
is the variance of the
x
j
independent variable (
j=1,2,3...p
) and 1
/
1
−
R
2
x
j
is technically called as variance inﬂa-tion factor (vif). The term 1
−
R
2
x
j
is the unexplained variation in the
x
j
independentvariable due to the same independent variables other than
x
j
.More speciﬁcally, statisti-cians named the term as Tolerance and inverse of the Tolerance is said to be the VIF.Byusing the fact
s
2
e
= (
n/n
−
k
)
σ
2
e
Rewrite (2) as
σ
2
β
j
=
n
(
n
−
1)(
n
−
k
)
s
2
x
j
(
σ
2
e
1
−
R
2
x
j
) (3)
σ
2
β
j
=
n
(
n
−
1)(
n
−
k
)
s
2
x
j
σ
2
INF
(
e
j
)
(4)From (3), the error variance (
σ
2
e
) of the given regression model plays a mediating rolebetween the variance of estimated regression co-eﬃcients (
σ
2
β
j
) and the VIF. Instead of analyzing the inﬂation of variance of estimated regression co-eﬃcients (
σ
2
β
j
), the authorsonly focused on the inﬂated part of the error variance due to the impact of multicollinear-ity as from (4). From (4)
σ
2
INF
(
e
j
)
is the inﬂated error variance which is inﬂated by the(
VIF
)
j
is equal to
σ
2
INF
(
e
j
)
=
σ
2
e
1
−
R
2
x
j
(5)
σ
2
INF
(
e
j
)
=
n
i
=1
(
e
i
/
1
−
R
2
x
j
)
2
n
(6)
σ
2
INF
(
e
j
)
=
n
i
=1
(
e
INF
(
e
ji
)
)
2
n
(7)From (5), the inﬂated error variance
σ
2
INF
(
e
j
)
which is always greater than or equalto the uninﬂated error variance
σ
2
e
where (
σ
2
INF
(
e
j
)
≥
σ
2
e
).If
R
2
x
j
is equal to 0, then boththe variances are equal, there is no multicollinearity. Similarly, If the
R
2
x
j
is equal to 1,then the error variance severely inﬂated and raise upto 8 and this shows the existence of severe multicollinearity. In the same manner, if 0
<R
2
x
j
<
1, then there will be a chanceof inﬂation in the error variance of the regression model. Likewise, from (6) and (7), theestimated errors
e
i
are also inﬂated by the
(
VIF
)
j
and it is transformed as estimatedinﬂated residuals
e
INF
(
e
ji
)
. If the estimated errors and error variance are inﬂated, thenthe forecasting performance of the model will decline and this leads to take inappropriateand illogic model selection decision. From the above discussion, the authors proved theproblem of multicollinearity not only inﬂates the variance of estimated regression co-eﬃcients but also inﬂates the error variance of the regression model. In order to ﬁnd thestatistical equality between the inﬂated error variance
σ
2
INF
(
e
j
)
and the uninﬂated errorvariance
σ
2
e
, the authors proposed an F-test by ﬁnding the link between sample vif andF-ratio. The methodology of applying the test statistic is discussed in the next section.
156
Jayakumar and Sulthan
3 Testing the Inﬂation of error variance
For the purpose of testing the statistical equality between sample
σ
2
e
and
σ
2
INF
(
e
j
)
, ﬁrst,the authors derived the test statistic by re-writing (5) as the basis and it is given as11
−
R
2
x
j
=
σ
2
INF
(
e
j
)
σ
2
e
(8)(
vif
)
j
=
σ
2
INF
(
e
j
)
σ
2
e
(9)From (8) and (9), it has been modiﬁed as
R
2
x
j
1
−
R
2
x
j
=
σ
2
INF
(
e
j
)
σ
2
e
−
1 (10)
R
2
x
j
1
−
R
2
x
j
= (
vif
)
j
−
1 (11)From (11), we using the fact(
sst
)
j
= (
ssr
)
j
+ (
sse
)
j
,
R
2
x
j
= (
ssr
)
j
/
(
sst
)
j
,1
−
R
2
x
j
=(
sse
)
j
/
(
sst
)
j
, rewrite (11) as(
ssr
)
j
(
sse
)
j
= (
vif
)
j
−
1 (12)Where ssr, sse, sst refers to the sample sum of squares of regression, error and thetotal respectively. Based on (12), it can be rewritten as in terms of the sample meansquares of regression (
s
2
r
) and error (
s
2
e
) as
qs
2
r
j
(
n
−
q
−
1)
s
2
e
j
= ((
vif
)
j
−
1) (13)From (13), multiply both sides by the population mean square ratios
σ
2
e
j
/σ
2
r
j
, we get
qs
2
r
j
/σ
2
r
j
(
n
−
q
−
1)
s
2
e
j
/σ
2
e
j
= (
σ
2
e
j
σ
2
r
j
)((
vif
)
j
−
1) (14)From (14), the ratios
qs
2
r
j
/σ
2
r
j
and (
n
−
q
−
1)
s
2
e
j
/σ
2
e
j
are followed chi-square distributionwith
q
and
n-q-1
degrees of freedom (where
q
is the no.of independent variables in theauxiliary regression model
x
j
=
α
oj
+
q
k
=1
α
jk
x
k
+
e
j
,
j
=
k
) and they are independent.The independency of the ratios are based on least square property if
x
ji
=
x
ji
+
e
ji
,then
x
ji
and
e
ji
are independent and the respective sum of squares are equal to(
sst
)
j
=(
ssr
)
j
+ (
sse
)
j
.without loss of generality, (14) can be written as in terms of the F-ratio

Similar documents

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks