글
참고주소 : http://www.rulequest.com/see5-comparison.html
0. C4.5와 C5.0은 어떻게 다른가?
C4.5 is a widely-used free data mining tool that is descended from an earlier system called ID3 and is followed in turn by See5/C5.0. To demonstrate the advances in this new generation, we will compare See5/C5.0 Release 2.06 with C4.5 Release 8 using three sizable datasets:
|
1. Rulesets: often more accurate, much faster, and much less memory
These graphs show the accuracy on unseen test cases, number of rules produced, and construction time for the three datasets. Results for C5.0 are shown in blue.
Both C4.5 and C5.0 can produce classifiers expressed either as decision trees or rulesets. In many applications, rulesets are preferred because they are simpler and easier to understand than decision trees, but C4.5's ruleset methods are slow and memory-hungry. C5.0 embodies new algorithms for generating rulesets, and the improvement is dramatic. Accuracy: The C5.0 rulesets have noticeably lower error rates on unseen cases for the sleep and forest datasets. The C4.5 and C5.0 rulesets have the same predictive accuracy for the income dataset, but the C5.0 ruleset is much smaller. |
2. 기능적인 차이 (C5.0 에서 추가된 기능)
New functionality The cases themselves may also be of unequal importance. In an application that classifies individuals as likely or not likely to "churn," for example, the importance of each case may vary with the size of the account. C5.0 has provision for a case weight attribute that quantifies the importance of each case; if this appears, C5.0 attempts to minimize the weighted predictive error rate. C5.0 has several new data types in addition to those available in C4.5, including dates, times, timestamps, ordered discrete attributes, and case labels. In addition to missing values, C5.0 allows values to be noted as not applicable. Further, C5.0 provides facilities for defining new attributes as functions of other attributes. Some recent data mining applications are characterized by very high dimensionality, with hundreds or even thousands of attributes. C5.0 can automatically winnow the attributes before a classifier is constructed, discarding those that appear to be only marginally relevant. For high-dimensional applications, winnowing can lead to smaller classifiers and higher predictive accuracy, and can even reduce the time required to generate rulesets. C5.0 is also easier to use. Options have been simplified and extended -- to support sampling and cross-validation, for instance -- and C4.5's programs for generating decision trees and rulesets have been merged into a single program. The Windows version, See5, has a user-friendly graphic interface and adds a couple of interesting facilities. For example, the cross-reference window makes classifiers more understandable by linking cases to relevant parts of the classifier. RuleQuest provides free source code for reading and interpreting See5/C5.0 classifiers. After the classifiers have been generated by See5/C5.0, this code allows you to access them from other programs and to deploy them in your own applications. For more information on See5/C5.0, please see the tutorial. |
3. Decision trees: faster, smaller
For all three datasets, C4.5 and C5.0 produce trees with similar predictive accuracies (although C5.0's are marginally better for the sleep and income applications). The major differences are the tree sizes and computation times; C5.0's trees are noticeably smaller and C5.0 is faster by factors of 6.5, 4.6, and 21 respectively. |
4. Boosting
Based on the research of Freund and Schapire, this is an exciting new development that has no counterpart in C4.5. Boosting is a technique for generating and combining multiple classifiers to improve predictive accuracy. The graphs above show what happens in 10-trial boosting where ten separate decision trees or rulesets are combined to make predictions. The error rate on unseen cases is reduced for all three datasets, substantially so in the case of forest for which the error rate of boosted classifiers is more than 40% below that of the corresponding C4.5 classifier. Unfortunately, boosting doesn't always help -- when the training cases are noisy, boosting can actually reduce classification accuracy. C5.0 uses a proprietary variant of boosting that is less affected by noise, thereby partly overcoming this limitation. C5.0 supports boosting with any number of trials. Naturally, it takes longer to produce boosted classifiers, but the results can justify the additional computation! Boosting should always be tried when peak predictive accuracy is required, especially when unboosted classifiers are already quite accurate. |
'스터디 자료' 카테고리의 다른 글
연관규칙, 순차패턴 기법 동영상 강의 (0) | 2009.08.04 |
---|---|
[강좌] 회귀분석 (Regression Analysis) 정리 (0) | 2009.08.04 |
연관규칙의 탐사 (Inside I.E 포항공대 전 치 혁 교수) (0) | 2009.08.04 |
[강좌] 다중공선성 (Multicollinearity) 정리 (8) | 2009.08.04 |
제 7회 SAS마이닝챔피언십 (2009년) (0) | 2009.08.02 |
댓글