|
In statistics, multicollinearity (also collinearity) is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others. In case of perfect multicollinearity the predictor matrix is singular and therefore cannot be inverted. Under these circumstances, for a general linear model , the ordinary least-squares estimator does not exist. Note that in statements of the assumptions underlying regression analyses such as ordinary least squares, the phrase "no multicollinearity" is sometimes used to mean the absence of perfect multicollinearity, which is an exact (non-stochastic) linear relation among the regressors. ==Definition== Collinearity is a linear association between ''two'' explanatory variables. Two variables are perfectly collinear if there is an exact linear relationship between them. For example, and are perfectly collinear if there exist parameters and such that, for all observations ''i'', we have : Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related. We have perfect multicollinearity if, for example as in the equation above, the correlation between two independent variables is equal to 1 or -1. In practice, we rarely face perfect multicollinearity in a data set. More commonly, the issue of multicollinearity arises when there is an approximate linear relationship among two or more independent variables. Mathematically, a set of variables is perfectly multicollinear if there exist one or more exact linear relationships among some of the variables. For example, we may have : holding for all observations ''i'', where are constants and is the ''i''th observation on the ''j''th explanatory variable. We can explore one issue caused by multicollinearity by examining the process of attempting to obtain estimates for the parameters of the multiple regression equation : The ordinary least squares estimates involve inverting the matrix : where : If there is an exact linear relationship (perfect multicollinearity) among the independent variables, the rank of X (and therefore of XTX) is less than k+1, and the matrix XTX will not be invertible. Perfect multicollinearity is fairly common when working with raw datasets, which frequently contain redundant information. Once redundancies are identified and removed, however, nearly multicollinear variables often remain due to correlations inherent in the system being studied. In such a case, instead of the above equation holding, we have that equation in modified form with an error term : : In this case, there is no exact linear relationship among the variables, but the variables are nearly perfectly multicollinear if the variance of is small for some set of values for the 's. In this case, the matrix XTX has an inverse, but is ill-conditioned so that a given computer algorithm may or may not be able to compute an approximate inverse, and if it does so the resulting computed inverse may be highly sensitive to slight variations in the data (due to magnified effects of rounding error) and so may be very inaccurate. 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「multicollinearity」の詳細全文を読む スポンサード リンク
|