Identification of multicollinearity and it’s effect in model selection

Please download to get full document.

View again

of 28
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Identification of multicollinearity and it’s effect in model selection
  Electronic Journal of Applied Statistical AnalysisEJASA, Electron. J. App. Stat. Anal. e-ISSN: 2070-5948DOI: 10.1285/i20705948v7n1p153 Identification of Multicollinearity and it’s effectin Model selection By Jayakumar and Sulthan April 26, 2014 This work is copyrighted by Universit`a del Salento, and is licensed un-der a  Creative Commons Attribuzione - Non commerciale - Non opere derivate3.0 Italia License .For more information see:  Electronic Journal of Applied Statistical Analysis  Vol. 7, Issue 1, 2014, 153-179DOI: 10.1285/i20705948v7n1p153 Identification of Multicollinearity andit’s effect in Model selection D.S. Jayakumar  ∗  and A. Sulthan Jamal Institute of Management, Tiruchirappalli, India  April 26, 2014 Multicollinearity is the problem experienced by statisticians while evalu-ating a sample regression model. This paper explored the relationship be-tween the sample variance inflation factor (vif) and F-ratio, based on thiswe proposed an exact F-test for the identification of multicollinearity and itovercomes the traditional procedures of rule of thumb. The authors criti-cally identified that the variance inflation factor not only inflates the vari-ance of the estimated regression coefficient and it also inflates the residualerror variance of a given fitted regression model in various level of inflation.Moreover, we also found a link between the problem of multicollinearity andits impact on the model selection decision. For this, the authors proposedmulticollinearity corrected version of generalized information criteria whichincorporates the effect of multicollinearity and help the statisticians to selecta best model among the various competing models. This procedure numer-ically illustrated by fitting 12 different types of stepwise regression modelsbased on 44 independent variables in a BSQ (Bank service Quality) study.Finally.the study result shows the transition in model selection after the cor-rection of multicollinearity effect. keywords:  Multicollinearity, variance inflation factor, Error-variance, F-test, Generalized Information criteria, multicollinearity penalization. 1 Introduction and Related work In the process of fitting regression model, when one independent variable is nearly com-bination of other independent variables, there will affect parameter estimates. This ∗ Corresponding author: c  Universit`a del SalentoISSN: 2070-5948  154  Jayakumar and Sulthan  problem is called multicollinearity. Basically, multicollinearity is not a violation of theassumptions of regression but it may cause serious difficulties Neter et al (1989) (1)variances of parameter estimates may be unreasonably large, (2) parameter estimatesmay not be significant, (3) a parameter estimate may have a sign different from whatis expected and so on Efron (2004). For solving or alleviating this problem in cer-tain regression model, the usually best way is dropping redundant variables from thismodel directly, that is to try to avoid it by not including redundant variables in theregression model Bowerman et al (1993). But sometimes, it is hard to decide the re-dundant variables. Another alternative to deleting variables is to perform a principalcomponent analysis Maddala (1977). With principal component regression, we createa set of artificial uncorrelated variables that can then be used in the regression model.Although principal component variables are dropped from the model, when the modelis transformed back, it will cause other biases too Draper and Smith (1981) Srivas-tava (2002).The transformation of the independent variables and the methods applicableto overcome the multicollinearity problem discussed above purely depends on the exactidentification of the problem. In this paper, the effect identification of the multicollinear-ity problem is discussed in a separate section and how it misleads the statisticians toselect a regression model based on the information criteria are visualized in the nextsection. 2 Inflation of error variance Consider an estimated sample regression model with a single dependent variable ( y i )with  p  regressors namely  x 1 i ,x 2 i ,x 3 i ,...x  pi  is given as y i  =   α  +  p   j =1   β   j x  ji  +   e i  (1)where   α  is the estimated Intercept,   β   j  is the estimated beta co-efficients or partialregression co-efficients and   e i  is the estimated residual followed normal distribution N(0, σ 2 e ).From (1), the sample regression model should satisfy the assumptions of normality,homoscedasticity of the error variance and the serial independence property. Thoughthe model satisfying all the assumptions, still it has to be evaluated. The authors moreparticularly focused on the multicollinearity and its effects leads the strong inter causaleffect among the independent variables. For the past 5 decades, statisticians believethat the impact of this multicollinearity problem severely inflates the variance of theestimated regression co-efficients. This creates greater instability and inconsistency inthe estimated co-efficients. Besides this, we identified a remarkable and astonishingproblem due to the multicollinearity and it will be mathematically identified below.Consider the variance of the estimated regression co-efficient as   σ 2    β j =  s 2 e ( n − 1) s 2 x j ( 11 − R 2 x j ) (2)  Electronic Journal of Applied Statistical Analysis   155Where  s 2 e is the unbiased estimate of the error variance,  s 2 x j is the variance of the  x  j independent variable (  j=1,2,3...p ) and 1 / 1 − R 2 x j is technically called as variance infla-tion factor (vif). The term 1 − R 2 x j  is the unexplained variation in the  x  j  independentvariable due to the same independent variables other than  x  j .More specifically, statisti-cians named the term as Tolerance and inverse of the Tolerance is said to be the VIF.Byusing the fact  s 2 e  = ( n/n − k )   σ 2 e  Rewrite (2) as   σ 2    β j =  n ( n − 1)( n − k ) s 2 x j (   σ 2 e 1 − R 2 x j ) (3)   σ 2    β j =  n ( n − 1)( n − k ) s 2 x j   σ 2 INF  ( e j )  (4)From (3), the error variance (   σ 2 e ) of the given regression model plays a mediating rolebetween the variance of estimated regression co-efficients (   σ 2    β j ) and the VIF. Instead of analyzing the inflation of variance of estimated regression co-efficients (   σ 2    β j ), the authorsonly focused on the inflated part of the error variance due to the impact of multicollinear-ity as from (4). From (4)   σ 2 INF  ( e j )  is the inflated error variance which is inflated by the( VIF  )  j  is equal to   σ 2 INF  ( e j )  =   σ 2 e 1 − R 2 x j (5)   σ 2 INF  ( e j )  = n  i =1 (   e i /   1 − R 2 x j ) 2 n  (6)   σ 2 INF  ( e j )  = n  i =1 (   e INF  ( e ji ) ) 2 n  (7)From (5), the inflated error variance   σ 2 INF  ( e j )  which is always greater than or equalto the uninflated error variance   σ 2 e  where (   σ 2 INF  ( e j )  ≥   σ 2 e ).If   R 2 x j  is equal to 0, then boththe variances are equal, there is no multicollinearity. Similarly, If the  R 2 x j  is equal to 1,then the error variance severely inflated and raise upto 8 and this shows the existence of severe multicollinearity. In the same manner, if 0 <R 2 x j < 1, then there will be a chanceof inflation in the error variance of the regression model. Likewise, from (6) and (7), theestimated errors   e i  are also inflated by the   ( VIF  )  j  and it is transformed as estimatedinflated residuals   e INF  ( e ji ) . If the estimated errors and error variance are inflated, thenthe forecasting performance of the model will decline and this leads to take inappropriateand illogic model selection decision. From the above discussion, the authors proved theproblem of multicollinearity not only inflates the variance of estimated regression co-efficients but also inflates the error variance of the regression model. In order to find thestatistical equality between the inflated error variance   σ 2 INF  ( e j ) and the uninflated errorvariance   σ 2 e , the authors proposed an F-test by finding the link between sample vif andF-ratio. The methodology of applying the test statistic is discussed in the next section.  156  Jayakumar and Sulthan  3 Testing the Inflation of error variance For the purpose of testing the statistical equality between sample   σ 2 e  and   σ 2 INF  ( e j ) , first,the authors derived the test statistic by re-writing (5) as the basis and it is given as11 − R 2 x j =   σ 2 INF  ( e j )   σ 2 e (8)( vif  )  j  =    σ 2 INF  ( e j )   σ 2 e (9)From (8) and (9), it has been modified as R 2 x j 1 − R 2 x j =   σ 2 INF  ( e j )   σ 2 e − 1 (10) R 2 x j 1 − R 2 x j = ( vif  )  j − 1 (11)From (11), we using the fact( sst )  j  = ( ssr )  j  + ( sse )  j , R 2 x j  = ( ssr )  j / ( sst )  j ,1 − R 2 x j  =( sse )  j / ( sst )  j , rewrite (11) as( ssr )  j ( sse )  j = ( vif  )  j − 1 (12)Where ssr, sse, sst refers to the sample sum of squares of regression, error and thetotal respectively. Based on (12), it can be rewritten as in terms of the sample meansquares of regression ( s 2 r ) and error ( s 2 e ) as qs 2 r j ( n − q  − 1) s 2 e j = (( vif  )  j − 1) (13)From (13), multiply both sides by the population mean square ratios σ 2 e j /σ 2 r j , we get qs 2 r j /σ 2 r j ( n − q  − 1) s 2 e j /σ 2 e j = ( σ 2 e j σ 2 r j )(( vif  )  j − 1) (14)From (14), the ratios  qs 2 r j /σ 2 r j  and ( n − q  − 1) s 2 e j /σ 2 e j  are followed chi-square distributionwith  q   and  n-q-1  degrees of freedom (where  q   is the no.of independent variables in theauxiliary regression model x  j  =  α oj  + q  k =1 α  jk x k  +  e  j ,  j   =  k ) and they are independent.The independency of the ratios are based on least square property if  x  ji  =   x  ji  +  e  ji ,then   x  ji and  e  ji  are independent and the respective sum of squares are equal to( sst )  j  =( ssr )  j  + ( sse )  j .without loss of generality, (14) can be written as in terms of the F-ratio
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks