검색결과 리스트
참고논문에 해당되는 글 12건
- 2009.07.29 [참고논문] Input Data for Decision Trees
- 2009.07.29 [참고논문] Facing Multillinearity in Data Mining 1
논문설명
아래의 논문은 [다중공선성에 견고한 결정트리 연구]에 대한 참고 논문이다. 결정트리에서의 다중공선성의 영향과 해결책에 대한 기존 연구는 많지 않은 상황인데, 이 논문이 좋은 참고자료가 되었다. 이 논문에서는 다중공선성을 포함하여 4개의 특성이 결정트리에 미치는 영향을 실험을 통하여 검증하였다. 회귀분석에서 사용하는 다중공선성 해결방법이 결정트리에 동일하게 적용되지 않는 점을 보였다. (상관성이 높은 변수를 제거한다고 결과 모델이 더 좋아지지는 않았다.) |
논문정보
원본파일 | |
검색방법 |
http://www.sciencedirect.com/science/journal/09574174검색어에 [Input Data for Decision Trees]를 입력검색 |
저널정보 |
저널 : Expert Systems with Applications |
저자정보 |
저자 : Selwyn Piramuthu |
분석자료 (2009/ 2/ 8) |
제목
Input Data for Decision Trees |
0. Abstract (서론)
Data Mining has been successful in a wide variety of application areas for varied purposes. Data Mining itself is done using several different methods. Decision Trees are one of the popular methods that have been used for Data Mining purposes. Since process of constructing these decision trees assume no distributional patterns in the data (non-parametric), characteristics of the input data are usually no given much attention. 데이터마이닝은 다양한 응용분야에서 다양한 목적으로 성공적으로 사용되고 있다. 데이터마이닝은 여러가지 다른 분석 방법들을 사용하고 있다. 특히 결정트리는 데이터마이닝 분석에서 사용되고 있는 인기있는 분석 방법 중의 하나이다. 결정트리를 생성하는 과정은 데이터의 분포에 대한 가정을 하지 않기 때문에 (즉, 비모수 분석이기 때문에) 입력 데이터의 특성에는 보통 주의를 기울이지 않는다. We consider some characteristics of input data and their effect on the learning performance of decision trees. Preliminary results indicates that the performance of decision trees can be improved with minor modifications of input data. 본 연구에서는 입력 데이터의 몇 가지 특성들을 고려하고 이 특성들이 결정트리의 학습 성능에 어떠한 영향을 미치는지를 고려한다. 기존 연구의 결과에서는 결정트리의 성능이 약간의 입력 데이터의 수정을 통하여 향상될 수 있음을 보여준다. * preliminary : http://endic.naver.com/endic.nhn?docid=896810&rd=s 준비의, 시초의 |
1. Introduction
Data Mining has been successful in a wide variety of application areas, including marketing, for varied purposes (Adomavicius & Tuzhinlin, 2001; Kushmerick, 1999; van der Putten, 1999; Shaw, Subramaniam, Tan, & Welge, 2001; Thearling, 1999). Data Mining itself is done using several different methods, depending on the type of data as well as the purpose of Data Mining (Ansari, Kohavi, Mason & Zheng, 2000; Cooley & Tan, 2000). For example, if the purpose is classification using real data, feed-forward neural networks might be appropriate (Ragavan & Piramuth, 1991). Decision trees might be appropriate if the purpose is classification using nominal data (Quinlan, 1993). Further, if the purpose is to identify associations in data, association rules might be appropriate (Brijs, Swinnen, Vanhoof, & Wets, 1999). Decision Trees are one of the popular methods that have been used for Data Mining purporses. Decision trees can be constructed using a variety of methods. For example, C4.5(Quinlan, 1993) uses information-theoretic measures and CART (Breiman, Friedman, Olshen, & Stone, 1984) uses statistical methods. The usefulness as well as classification and computional performance of Data Mining frameworks incorporating decision trees can be improved by (1) appropriate preprocessing of input data, (2) fine-tuning the decision tree algorithm itself, and (3) better interpretation of output. There have been several studies that have addressed each of these scenarios. (1) 입력데이터에 대한 적절한 전처리 Input data can be preprocessed (1) to reduce the complexity of data for ease of learning, and(2) to reduce effects due to unwanted characteristics of data. The former includes such techniques as feature selection and feature construction as well as other data modifications (see, for example, Brijs & Vanhoof, 1998; Kohavi, 1995; Ragavan & Piramuthu, 1991). The latter includes removal of noisy, redundant, and irrelavant data used as input to decision tree learning. We consider some characteristics of input data and its effect on the learning performance of decision trees. Specifically, we consider the effects on non-linearity, outliers, heteroschedaticity, and multicollinearity in data. These have been shown to have significant effects on regression analysis. However, there has not been any published study that deals with these characteristics and their effects on the learning performance of decision trees. Using a few small data sets that are available over the Internet, we consider each of these characteristics and compare their effects on regression analysis as well as decision trees. The result from regression analysis are from the Internet. The contribution of this paper is in studying the effects of these characteristics on decision trees, specially See-5 (2001). Preliminary results suggest that the performance of decision trees can be improved with minor modification of input data. The rest of the paper is organized as follows : Evaluation of some input data characteristics and their effects on the learning performance of decision trees is provided in the next section. Experimental results are also included in Section 2. Section 3 concludes the paper with a brief discussion of the results from this study and their implications as well as future extensions to this study. 본 논문의 나머지 부분은 다음과 같이 구성되었다. 다음 장(2장)에서는 몇 개의 입력 데이터의 특성과 이들의 결정트리 학습 성능에 대한 효과를 평가할 것이다. 2장은 실험의 결과 또한 포함하고 있다. 3장에서는 본 연구의 결과에 대한 간단한 토의내용과 향후 연구에 대해서 제시한다. |
1장 정리 본 논문의 1장에서는 데이터마이닝 및 결정트리에 대한 개략적인 설명을 하고 있다. 다음의 4가지 데이터의 특성이 회귀분석 뿐 아니라 결정트리에도 영향을 미치는 것을 언급하고 있다. (1) 비선형성 (2) 이상치 (3) 등분산성 (4) 다중공선성 이들 특성들이 |
Traditional statistical regression analysis assumes certian distribution (e.g., Gaussian) of input data, as well as other characteristics of data such as the data being independent and identically distributed. In most real world data, some of these assumtions are often violated. And, there are several means to at least partially rectify some of the consequences that arise from these violations. We consider a few of these situations: non-linearity in the data, the presence of outliers in the data, the presence of heteroschedacity in the data, and the presence of multicolliearity in the data. The data sets used in this study are known to have these charateristics. The following subsections address each of these scenarios in turn. We use See-5 as the decision tree generator throughout this study. * identially distributed : 동일하게 분포됨 (균등분포)
2. Evaluation of input data charcteristics for decision trees
전통적인 통계학의 회귀분석 기법은 특정한 입력 데이터 분포(예를들어, 가우시안 분포)를 가정하고 있다.
그 외에도 변수들의 독립성, 균등분포 등의 데이터 특성들을 가정한다. 그러나 대부분의 실 세계의 데이터에서는 이러한 가정들은 위배된다. 그리고, 이러한 위배를 통해 발생되는 결과(문제)들을 수정하기 위한 여러가지 방법들이 존재한다. 본 연구에서는 이들 상황 중 다음의 4가지를 고려한다. 4가지는 입력데이터의 비선형성, 이상치데이터, 등분산성, 다중공선성 이다. 본 연구에서 사용한 데이터셋은 이들 특성들을 갖는 것으로 알려져있다. 본 장에 나머지 부분에서는 차례대로 이들 시나리오의 각각에 대하여 해결책을 다루고 있다.
본 연구에서는 See-5 결정트리를 사용할 것이다.
* rectify : 수정하다. 조정하다.
* address : 다루다. 처리하다. 초점을 맞추다.
2.1 Non-linearity in input data
Non-linearity is a problem in linear regression simply because it is hard to fit a linear model on a non-linear data. Therefore, non-linear transformations are made to the data before runing regressions on these data. We consider the effects of non-linear data on decision trees both before and after the appropriate data transformations are made. 비선형성은 선형 회귀분석에서 나타나는 단순한 문제로써, 비선형 데이터를 선형 모델을 적합 시키기 어렵기 때문에 발생한다. 그러므로, 이 데이터를 회귀분석에 수행하기에 앞서 먼저 비선형 변형을 수행한다. 본 연구에서는 비선형 데이터에 대해 변형을 수행하기 전과 후 모두에 대하여 결정트리에 미치는 영향을 고려한다. This data set contains four variables - the independent variables x1, x2, and x3 and the dependent variable y. Result using ordinary least squares(OLS) regression to predict y using x1, x2, and x3 are provided below. 이 데이터셋은 [x1, x2, x3]의 3개의 독립변수와 종속변수 y의 4개의 변수를 포함하고 있다. 결과는 OLS 회귀분석을 사용하여 x1, x2, x3를 사용하여 y를 예측한 회귀분석의 결과는 아래와 같다. The presence of higher order trend effects are indentified to be present in the data using the omitted varable test(ovtest with the rhs option in the stastistical analysis software Stata). Higher order trend effects are also present. On inspection of scatter plots of the data, the presence of non-linear trend patterns in the data in the variable x2 is confirmed. We substitute x2 with its centered (x2cent) value (i.e., substract its mean from every value) and the square (x2centsq) of the centered value. The results from this regression are provoided below: Now, let us consider the same two sets of data, both before and after incorporating the squared term, and evaluate its effects on the performance of decision tree learned. We use 10-fold cross-validation in See-5 to reduce any bias due to sample selection. Both the mean values and the standard deviation values (in parentheses) are provoided for the resulting decision trees. 여기서 우리는 동일한 두 데이터를 고려하였다. 제곱 항을 추가하기 전과 후에 대해서, 학습된 결정트리의 성능에 어떠한 영향을 미치는 지를 평가하였다. 이 실험에서 실험 데이터 샘플을 선택하기 위하여 10-fold 교차검증의 방법으로 See-5 를 수행하였다. 평균이나 표준편차 값은 결정트리를 생성할 때 모두 사용하였다. Here (Table1), the addition of the two transformed x2 variables has resulted in a small reduction in the size of the decision trees and a significant decrease in the prediction error. The prediction error is the classification error on unseen (during generation of decision trees) examples. |
2.2 Presence of outliers in input data
2.3 Heteroschedaticity in input data
2.4 Multicollinearity in input data
Data where the independent variables are highly correlated is said to have multicollinearity. In regerssion analysis, multicollinearity is a problem when we are interested in the exact values of the coefficients of the independent valiables. 데이터의 종속변수들 간에 높은 상관관계가 존재할 때, 이 데이터에 대하여 다중공선성을 갖는다고 말한다. 회귀분석에서, 우리가 독립변수들의 회귀계수의 정확한 값에 관심을 가지고 있을 때 다중공선성이 문제가 된다. When multicollinearity is present, this is not possible. Multicollinearity is identified by (1) the presence of high pairwise correlation among independent variables, (2) a high R^2 value with low t-statistics, and (3) the coefficients change when variables are added and dropped from the model. (1) 독립변수들 사이에 두 변수들 간에 높은 상관성이 존재할 경우. Multicollinearity is not a problem when the only poupose of regerssion analysis is forecasting. However, if the anlysis is to determine and evaluate the coefficients, multicollinearity is a problem. One of the ways to alleviate this problem is to drop the variable with highest pari-wise correlation values among the independent variables. We consider the effects of multicollinearity on decision trees both before and after the problem has been alleviated in the input data. * alleviate : 덜다, 완화하다, 경감하다, 편하게 하다. Here, the R^2 value is significant while the t-statistics are not. We then consider the pair-wise correlations among the independent variables. The resulting matrix is given below: Clearly, x4 is highly correlated with the rest of the independent variables. This variable is then removed from the data. The result form the OLS regression run without x4 is given below: Here, all the varibales turn out to be significant. Now, let us consider the same two of data, both before and after removing x4 form the data, and evaluate its effects on the performance of decision tree learned. Again, we use 10-fold cross-validation in See-5 to reduce any bias due to sample selection. Here (Table4), the transformation of the dependent varibale has resulted in a significant in the size of the decision trees and a significant increase in the prediction error. Here, removal of the variable does not seem to help the performance of decision trees. |
2.5 Data reduction
3. Discussion
Even though decision trees constructed using information-theoretic measures are considered non-parametic, the distribution of data does influence the classification performance of these decision trees. Preliminary results indecate that the performance of decision trees can be improved by considering the effects due to non-linearity, outliers, heteroschedacity, and multicollinearity in input data as well as data reduction. Both non-linearity and the presence of outliers did affect the classification performance of decision trees. The Presence of heteroschedaticity did not affect the classfication performance of decision trees significantly. The attempt to remove multicollineartiy resulted in poor classification performance. Data reduction resulted in improved performance both in terms of the resulting tree-size and classification. We are currently in the process of evaluating the results we have thus far using larger and more data sets. We are also in the process of studying why these data chacteristics affect the classification performance of decision trees. Also, in this study, we were only interested in the size of the decision tree and their classification accuracy. The computational cost of this process is also important, and that is left as an exercise of a future study. In addition to the characteristics presented in this paper, we are also evaluating other data characteristics including non-independent and non-normality of data. We presented one possible means to improve the classification performance of decision trees. This, along with other pre-processing methods (such as feature selection and feature construction), methods for fine-tuning decision trees, and those that enhance interpretability of results, would help improve the overall performance of these decision support tools incorporating decision trees. |
4. References
[1] Adomavicius, G., & Tuzhilin, A. (2001). Using data mining methods to build customer profiles. [2] Anderson, E. (1935). The Irises of the Gaspe Peninsula. Bulletin of the American Iris Society, [3] Ansari, S., Kohavi, R., Mason, L., & Zheng, Z. (2000). Integrating E-commerce and data mining: |
[참고논문] 능형회귀에서의 로버스트한 k의 선택 방법 (0) | 2009.07.29 |
---|---|
[논문] 다중공선성 상태의 주성분회귀와 능형회귀 (0) | 2009.07.29 |
[참고논문] 지지벡터머신을 이용한 결정트리 (0) | 2009.07.29 |
[논문] Extracting decision tree from neural network (0) | 2009.07.29 |
[참고논문] Facing Multillinearity in Data Mining (1) | 2009.07.29 |
Facing Multicollinearity in Data Mining |
저자 : Isabella Morlini (isabella.morlini@unipr.it)
대학 : University of Parma
검색 : http://www.google.com/search?source=ig&hl=en&rlz=&q=Facing+Multicollinearity+in+Data+Mining&aq=f&oq=Facing+Multicollinearity+in+Data+Minin
원문 :
1. Introduction
In regression problems, when the form of the relationship between a dependent variable and multiple predictors is not a priori, non-prarametric model are often applied in order to extract knowledge from data and to adaptively build a regression function. (고려할 사항들 : It is the goal of this paper to understand how nonlinear methods based on the backfitting algorithm are affected by multicollinearity and to show that projection methods manage better with this problem. The structure of the paper is as follows. Section 2 briefly reviews the backfitting algorithms in GAMs and MARS. Section 3 focues on a numerical example in order to understancd futhur how these methods work and to compare them with other selection and projection tools. Section 4 provides concluding remarks. |
1장 정리 본 연구에서는 다중공선성 특성을 갖는 데이터에 대해서는 비모수 분석 기법이 적합함을 제시한다.모수적 분석 기법은 데이터의 입력 변수들 간에 분포 특성을 가정하는 분석 기법이다. (변수들 간에 선형성이 없다는 것을 가정으로 한다.) 모수적방법 : CART, (SSE)를 사용하는 Regression 분석 등 |
2. The Backfitting algorithms in GAMs and MARS
In the following we briefly review the backfiiting algorithm in GAMs and MARS. We assume that the readers are familar with these models and, due to space limitation, we do not describe them but we refer the readers to descriptions in Ripley (1996, camp. IV). The backfitting algorithm (Hastie and Tibshirani, 1990) adaptively builds as set of basis functions by forward selection. This technique works in the origainal coordinate system and finds linear and nonlinear combinations of these coordinates. In GAMs the forward procedure holds all but on of the additive terms constant, removes that term and fits a smooth term to the residuals. The Fitting is applied a variable at a time until the process converges. GAMs 알고리즘에서는 전방향 절차를 수행하여 모든 변수와 상수를 선택한다. 변수들을 제거하면서 남은 오차에 term들을 부드럽게 적합시킨다. 수렴의 과정을 거치면서 한 번에 한 변수씩 적합화가 수행된다. It is this procedure that makes GAMs vulnerable to collinearites between the dependent and the independent variables. If the first variables are correlated with the response, and the sooth term is flexible enough, then the partial residuals result in small values and the algorithm may converge before processing all variables. So the final model depend on the order in which varibables are presented. In a less extreme case, all predictors are selected as basis functions, but the degree of freedom of each basis function may arbitrarily depend on the order of the variables. In MARS a tree structure is present and interaction between variables is explicitly allowed. In forward procedure is somehow different form GAMs. For each predictor and every possible value of these predictors (knot), MARS divides the data into two parts. one on either side of the knot. MARS selects the knot and variable pair which give the best fit, and to each part it fits the response using a pair of linear functions. If two variables are correlated, at same stage of the tree construction MARS may be forced to choose between placing a knot on one of these predictors. If both predictors result in roughly the same penalized residual sum of the squares, then the selection may be somehow arbitrarily and in the final set of basis function only one of these variables may be represented. In an extrme case it may happen that the choice of one varibale at the current step may have a great impact on the choice of all furthur variables and knot selections and thus on the final model as well. The backward step, which follows the forward phase and aims to produce a model with comparable performance but fewer terms, is also vulerable to multicollinearity, escpecially in the additive case (when no interaction is allowed) since over-fitting is avoided by reducing the number of knots rather then via a smoothness penalty. 극단적인 경우, 현 단계에서의 한 변수의 선택은 그 다음 단계에서의 변수 및 분리기준(knot)의 선택에 있어서 매우 큰 영향을 미치게 되고, 최종 모델에도 큰 영향을 미치게 된다. 역방향 단계도 마찬가지로 (역방향 단계에는 전방향 단단계의 생성 순서를 따르며, 좀 더 적은 항들로 구성되며 좀 더 향상된 성능을 나타내는 모델을 생성한다.) 다중공선성에 취약하다. 특히 부가적인 경우 [상호작용이 허용되지 않는 경우]에 그렇다. 왜냐하면 부드러움(가지치기)의 불이익을 통해서 분리기준(knot)의 개수를 감소하여 과적합의 문제를 피하기 때문이다. * vulnerable : 상처받기 쉬운, 공격받기 쉬운 http://endic.naver.com/endic.nhn?docid=1257640 In conclusion, MARS and GAMs are affected by multicollinearity in that they select the basis function in some arbitrarily manner, since this choice has no impact on the SSE when a set of variables is highly correlated with each others and with the response. In Addition, in many applications a subset selection of the predictors may not be the optimal choice since a weighted average of the input variables may be preferable to the single one with the highest correlation with the response (for example, in quality control, a weighted average of sensors may be preferable to a singe one). 결론을 내리자면, 평가함수가 임의의 방식으로 변수를 선택하기 때문에 MARS 및 GAMs 알고리즘은 다중공선성의 영향을 받게 된다. 왜냐하면, 두 변수가 서로 높은 상관성이 있는 경우 그 중 어떤 변수를 선택하든 SSE(오차총합)에는 영향이 (거의) 없기 때문이다. 추가로, (예측) 변수들을 선택하는 많은 방법들(응용 프로그램들)은 최적의 답(모델)을 제시해주 못할 수 있게 된다. 왜냐하면 입력변수들에 대한 가중치 평균이 종속변수와 상관성이 높은 하나의 변수보다 선호되기 문이다. (예를 들어, 품질 조절에서, 센서들의 가중치 평균이 하나의 센서(?)보다 선호될 수 있다.) |
2장 정리 본 장에서는 MARS 및 GAMs 라는 결정트리 알고리즘이 다중공선성에 영향을 받는 다는 것을 설명하였다. 왜 문제가 되는지에 대해서 대체로 잘 서술하였고, 참고자료로서 도움이 된다. 맨 아래 부분의 가중치 합이 하나 보다 선호된다는 부분은 잘 이해가 되지 않는다. |
3. Numerical Example
4. Conclusion
Non-linear selection models based on the backfitting algorithm are often liked better than non parametric projection methods, since they build simpler and more understandable models. However, they are affected by multicollinearity in that they select the knots placement in some arbitrarily manner, when this choice has not impact on the SSE. Hence, they may not be the optimal alternative in model building in presence of multicollinearity. Nonparametric methods like PPR and MLP are shown to find the correct dimension of the projection space relevant for predicton. RBFNs are found to give rise to numerical problems using the gaussian transformation. With a non localized function, they are shown to indentify a projection space not far from the dimension of the relevant subspace. |
5. Main References
[1] Hastie T.J., Tibshirani L.J. (1990) Generalized Additive Models, Chapman, London. |
[참고논문] 능형회귀에서의 로버스트한 k의 선택 방법 (0) | 2009.07.29 |
---|---|
[논문] 다중공선성 상태의 주성분회귀와 능형회귀 (0) | 2009.07.29 |
[참고논문] 지지벡터머신을 이용한 결정트리 (0) | 2009.07.29 |
[논문] Extracting decision tree from neural network (0) | 2009.07.29 |
[참고논문] Input Data for Decision Trees (0) | 2009.07.29 |
RECENT COMMENT