글
Facing Multicollinearity in Data Mining |
저자 : Isabella Morlini (isabella.morlini@unipr.it)
대학 : University of Parma
검색 : http://www.google.com/search?source=ig&hl=en&rlz=&q=Facing+Multicollinearity+in+Data+Mining&aq=f&oq=Facing+Multicollinearity+in+Data+Minin
원문 :
1. Introduction
In regression problems, when the form of the relationship between a dependent variable and multiple predictors is not a priori, non-prarametric model are often applied in order to extract knowledge from data and to adaptively build a regression function. (고려할 사항들 : It is the goal of this paper to understand how nonlinear methods based on the backfitting algorithm are affected by multicollinearity and to show that projection methods manage better with this problem. The structure of the paper is as follows. Section 2 briefly reviews the backfitting algorithms in GAMs and MARS. Section 3 focues on a numerical example in order to understancd futhur how these methods work and to compare them with other selection and projection tools. Section 4 provides concluding remarks. |
1장 정리 본 연구에서는 다중공선성 특성을 갖는 데이터에 대해서는 비모수 분석 기법이 적합함을 제시한다.모수적 분석 기법은 데이터의 입력 변수들 간에 분포 특성을 가정하는 분석 기법이다. (변수들 간에 선형성이 없다는 것을 가정으로 한다.) 모수적방법 : CART, (SSE)를 사용하는 Regression 분석 등 |
2. The Backfitting algorithms in GAMs and MARS
In the following we briefly review the backfiiting algorithm in GAMs and MARS. We assume that the readers are familar with these models and, due to space limitation, we do not describe them but we refer the readers to descriptions in Ripley (1996, camp. IV). The backfitting algorithm (Hastie and Tibshirani, 1990) adaptively builds as set of basis functions by forward selection. This technique works in the origainal coordinate system and finds linear and nonlinear combinations of these coordinates. In GAMs the forward procedure holds all but on of the additive terms constant, removes that term and fits a smooth term to the residuals. The Fitting is applied a variable at a time until the process converges. GAMs 알고리즘에서는 전방향 절차를 수행하여 모든 변수와 상수를 선택한다. 변수들을 제거하면서 남은 오차에 term들을 부드럽게 적합시킨다. 수렴의 과정을 거치면서 한 번에 한 변수씩 적합화가 수행된다. It is this procedure that makes GAMs vulnerable to collinearites between the dependent and the independent variables. If the first variables are correlated with the response, and the sooth term is flexible enough, then the partial residuals result in small values and the algorithm may converge before processing all variables. So the final model depend on the order in which varibables are presented. In a less extreme case, all predictors are selected as basis functions, but the degree of freedom of each basis function may arbitrarily depend on the order of the variables. In MARS a tree structure is present and interaction between variables is explicitly allowed. In forward procedure is somehow different form GAMs. For each predictor and every possible value of these predictors (knot), MARS divides the data into two parts. one on either side of the knot. MARS selects the knot and variable pair which give the best fit, and to each part it fits the response using a pair of linear functions. If two variables are correlated, at same stage of the tree construction MARS may be forced to choose between placing a knot on one of these predictors. If both predictors result in roughly the same penalized residual sum of the squares, then the selection may be somehow arbitrarily and in the final set of basis function only one of these variables may be represented. In an extrme case it may happen that the choice of one varibale at the current step may have a great impact on the choice of all furthur variables and knot selections and thus on the final model as well. The backward step, which follows the forward phase and aims to produce a model with comparable performance but fewer terms, is also vulerable to multicollinearity, escpecially in the additive case (when no interaction is allowed) since over-fitting is avoided by reducing the number of knots rather then via a smoothness penalty. 극단적인 경우, 현 단계에서의 한 변수의 선택은 그 다음 단계에서의 변수 및 분리기준(knot)의 선택에 있어서 매우 큰 영향을 미치게 되고, 최종 모델에도 큰 영향을 미치게 된다. 역방향 단계도 마찬가지로 (역방향 단계에는 전방향 단단계의 생성 순서를 따르며, 좀 더 적은 항들로 구성되며 좀 더 향상된 성능을 나타내는 모델을 생성한다.) 다중공선성에 취약하다. 특히 부가적인 경우 [상호작용이 허용되지 않는 경우]에 그렇다. 왜냐하면 부드러움(가지치기)의 불이익을 통해서 분리기준(knot)의 개수를 감소하여 과적합의 문제를 피하기 때문이다. * vulnerable : 상처받기 쉬운, 공격받기 쉬운 http://endic.naver.com/endic.nhn?docid=1257640 In conclusion, MARS and GAMs are affected by multicollinearity in that they select the basis function in some arbitrarily manner, since this choice has no impact on the SSE when a set of variables is highly correlated with each others and with the response. In Addition, in many applications a subset selection of the predictors may not be the optimal choice since a weighted average of the input variables may be preferable to the single one with the highest correlation with the response (for example, in quality control, a weighted average of sensors may be preferable to a singe one). 결론을 내리자면, 평가함수가 임의의 방식으로 변수를 선택하기 때문에 MARS 및 GAMs 알고리즘은 다중공선성의 영향을 받게 된다. 왜냐하면, 두 변수가 서로 높은 상관성이 있는 경우 그 중 어떤 변수를 선택하든 SSE(오차총합)에는 영향이 (거의) 없기 때문이다. 추가로, (예측) 변수들을 선택하는 많은 방법들(응용 프로그램들)은 최적의 답(모델)을 제시해주 못할 수 있게 된다. 왜냐하면 입력변수들에 대한 가중치 평균이 종속변수와 상관성이 높은 하나의 변수보다 선호되기 문이다. (예를 들어, 품질 조절에서, 센서들의 가중치 평균이 하나의 센서(?)보다 선호될 수 있다.) |
2장 정리 본 장에서는 MARS 및 GAMs 라는 결정트리 알고리즘이 다중공선성에 영향을 받는 다는 것을 설명하였다. 왜 문제가 되는지에 대해서 대체로 잘 서술하였고, 참고자료로서 도움이 된다. 맨 아래 부분의 가중치 합이 하나 보다 선호된다는 부분은 잘 이해가 되지 않는다. |
3. Numerical Example
4. Conclusion
Non-linear selection models based on the backfitting algorithm are often liked better than non parametric projection methods, since they build simpler and more understandable models. However, they are affected by multicollinearity in that they select the knots placement in some arbitrarily manner, when this choice has not impact on the SSE. Hence, they may not be the optimal alternative in model building in presence of multicollinearity. Nonparametric methods like PPR and MLP are shown to find the correct dimension of the projection space relevant for predicton. RBFNs are found to give rise to numerical problems using the gaussian transformation. With a non localized function, they are shown to indentify a projection space not far from the dimension of the relevant subspace. |
5. Main References
[1] Hastie T.J., Tibshirani L.J. (1990) Generalized Additive Models, Chapman, London. |
'참고논문 > 참고논문전체' 카테고리의 다른 글
[참고논문] 능형회귀에서의 로버스트한 k의 선택 방법 (0) | 2009.07.29 |
---|---|
[논문] 다중공선성 상태의 주성분회귀와 능형회귀 (0) | 2009.07.29 |
[참고논문] 지지벡터머신을 이용한 결정트리 (0) | 2009.07.29 |
[논문] Extracting decision tree from neural network (0) | 2009.07.29 |
[참고논문] Input Data for Decision Trees (0) | 2009.07.29 |
RECENT COMMENT