참고주소 : http://www.rulequest.com/see5-comparison.html

0. C4.5와 C5.0은 어떻게 다른가?

C4.5 is a widely-used free data mining tool that is descended from an earlier system called ID3 and is followed in turn by See5/C5.0. To demonstrate the advances in this new generation, we will compare See5/C5.0 Release 2.06 with C4.5 Release 8 using three sizable datasets:


Sleep stage scoring data (sleep, 105,908 cases). Every case in this monitoring application is described by six numeric-valued attributes and belongs to one of six classes. C5.0 and C4.5 use 52,954 cases to construct classifiers that are tested on the remaining 52,954 cases. The data are available from here.
Census income data (income, 199,523 cases). The goal of this application is to predict whether a person's income is above or below $50,000 using seven numeric and 33 discrete (nominal) attributes. The data are divided into a training set of 99,762 cases and a test set of 99,761. They can be obtained from the UCI KDD Archive.
Forest cover type data (forest, 581,012 cases); also from UCI. This application has seven classes (possible types of forest cover), and the cases are described in terms of 12 numeric and two multi-valued discrete attributes. As before, half of the data -- 290,506 cases -- are used for training and the remainder for testing the learned classifiers.
Since C4.5 is a Unix-based system, results for the Unix version C5.0 are presented to facilitate comparison. Both were compiled using the Intel C compiler 10.1 with the same optimization settings. Times are elapsed seconds for an unloaded Intel Q9550 (2.8GHz) with 2GB of RAM running 64-bit Fedora Core.
So, let's see how C5.0 stacks up against C4.5.

1. Rulesets: often more accurate, much faster, and much less memory

 These graphs show the accuracy on unseen test cases, number of rules produced, and construction time for the three datasets. Results for C5.0 are shown in blue.

Both C4.5 and C5.0 can produce classifiers expressed either as decision trees or rulesets. In many applications, rulesets are preferred because they are simpler and easier to understand than decision trees, but C4.5's ruleset methods are slow and memory-hungry. C5.0 embodies new algorithms for generating rulesets, and the improvement is dramatic.

Accuracy: The C5.0 rulesets have noticeably lower error rates on unseen cases for the sleep and forest datasets. The C4.5 and C5.0 rulesets have the same predictive accuracy for the income dataset, but the C5.0 ruleset is much smaller.
Speed: The times are almost not comparable. For instance, C4.5 required more than 84 minutes to find the ruleset for income, but C5.0 completed the task in just over 4 seconds.
Memory: C5.0 commonly uses an order of magnitude less memory than C4.5 during ruleset construction. For the forest dataset, C4.5 needs more than 3GB (the job would not complete on earlier 32-bit systems), but C5.0 requires less than 200MB.

2. 기능적인 차이 (C5.0 에서 추가된 기능)

New functionality
C5.0 incorporates several new facilities such as variable misclassification costs. In C4.5, all errors are treated as equal, but in practical applications some classification errors are more serious than others. C5.0 allows a separate cost to be defined for each predicted/actual class pair; if this option is used, C5.0 then constructs classifiers to minimize expected misclassification costs rather than error rates.

The cases themselves may also be of unequal importance. In an application that classifies individuals as likely or not likely to "churn," for example, the importance of each case may vary with the size of the account. C5.0 has provision for a case weight attribute that quantifies the importance of each case; if this appears, C5.0 attempts to minimize the weighted predictive error rate.

C5.0 has several new data types in addition to those available in C4.5, including dates, times, timestamps, ordered discrete attributes, and case labels. In addition to missing values, C5.0 allows values to be noted as not applicable. Further, C5.0 provides facilities for defining new attributes as functions of other attributes.

Some recent data mining applications are characterized by very high dimensionality, with hundreds or even thousands of attributes. C5.0 can automatically winnow the attributes before a classifier is constructed, discarding those that appear to be only marginally relevant. For high-dimensional applications, winnowing can lead to smaller classifiers and higher predictive accuracy, and can even reduce the time required to generate rulesets.

C5.0 is also easier to use. Options have been simplified and extended -- to support sampling and cross-validation, for instance -- and C4.5's programs for generating decision trees and rulesets have been merged into a single program.

The Windows version, See5, has a user-friendly graphic interface and adds a couple of interesting facilities. For example, the cross-reference window makes classifiers more understandable by linking cases to relevant parts of the classifier.

RuleQuest provides free source code for reading and interpreting See5/C5.0 classifiers. After the classifiers have been generated by See5/C5.0, this code allows you to access them from other programs and to deploy them in your own applications.

For more information on See5/C5.0, please see the tutorial.


3. Decision trees: faster, smaller

 For all three datasets, C4.5 and C5.0 produce trees with similar predictive accuracies (although C5.0's are marginally better for the sleep and income applications). The major differences are the tree sizes and computation times; C5.0's trees are noticeably smaller and C5.0 is faster by factors of 6.5, 4.6, and 21 respectively.

4. Boosting

 Based on the research of Freund and Schapire, this is an exciting new development that has no counterpart in C4.5. Boosting is a technique for generating and combining multiple classifiers to improve predictive accuracy.

The graphs above show what happens in 10-trial boosting where ten separate decision trees or rulesets are combined to make predictions. The error rate on unseen cases is reduced for all three datasets, substantially so in the case of forest for which the error rate of boosted classifiers is more than 40% below that of the corresponding C4.5 classifier. Unfortunately, boosting doesn't always help -- when the training cases are noisy, boosting can actually reduce classification accuracy. C5.0 uses a proprietary variant of boosting that is less affected by noise, thereby partly overcoming this limitation.

C5.0 supports boosting with any number of trials. Naturally, it takes longer to produce boosted classifiers, but the results can justify the additional computation! Boosting should always be tried when peak predictive accuracy is required, especially when unboosted classifiers are already quite accurate.


by 에이아이 2009. 8. 4. 19:19