Using Decision Tree to Solve Classification Problems

The Decision Tree task can solve classification problems by building a tree structure of intelligible rules.

Prerequisites

a Rulex process has been created
the required datasets have been imported into the process
the data used for the model has been well prepared, and includes a categorical output and a number of inputs. Data preparation may involve discretization before building the decision tree, to improve the accuracy of the model and to reduce the computational effort.
a single unified dataset has been created by merging all the datasets imported into the process.

Additional tabs

Along with the Options tab, where the task can be configured, the following additional tabs are provided:

Documentation tab where you can document your task,
Parametric options tab where you can configure process variables instead of fixed values. Parametric equivalents are expressed in italics in this page (PO).
Monitor and results tabs, where you can see the output of the task computation. See Results table below.

Procedure

Drag and drop the Decision Tree task onto the stage.
Connect a task, which contains the attributes from which you want to create the model, to the new task.
Double click the Decision Tree task.
Configure the options described in the table below.
Save and compute the task.

Parameter Name	PO	Description
Decision Tree options
Input attributes	inpnames	Drag and drop the input attributes which will be used to classify data in the decision tree.
Output attributes	outnames	Drag and drop the attributes which will be used to form the final classes into which the dataset will be divided.
Minimum number of patterns in a leaf	npattnode	The minimum number of patterns that a leaf can contain. If a node contains less than this threshold, tree growth is stopped and the node is considered a leaf.
Impurity measure	impurtype	The method used to measure the impurity of a leaf. Considering a classification problem with c classes and a given node η, the following choices are currently available: Entropy, given by where P_η(y_j) is the frequency of the j-th class in the node η. Gini, given by Error, given by
Pruning method	treepruning	The method used to prune redundant leaves after tree creation. The following choices are currently available: No pruning: leaves are not pruned and the tree is left unchanged. Cost-complexity: according to this approach, implemented in CART the tree is pruned through a cost-complexity measure that creates a sequence of sub-trees and finds the best one through the application on a validation set. Each sub-tree is created from the previous one by minimizing a cost-complexity measure that takes into account both the misclassification level in the training set and the number of leaves. Reduced error: this simple method, introduced by Quinlan, employs the validation set to decide whether a sub-tree should be replaced by a single leaf. If the error in the validation set after transforming an internal node into a leaf decreases, the relative sub-tree is removed. Pessimistic (default choice): the tree is based according to the pessimistic pruning approach introduced by Quinlan. Using this method it is not necessary to create a validation set since the training set is employed both for tree creation and for tree pruning. Pessimistic pruning makes use of a a correction for the error rate (pessimistic error) at each node to decide whether it is to be pruned or not.
Maximum impurity in a leaf	maximpur	Specify the threshold on the maximum impurity in a node. The impurity is calculated with the method selected in the Impurity measure option. By default this value is zero, so trees grow until a pure node is obtained (if possible with training set data) and no ambiguities remain.
Method for handling missing data	treeusemissing	Select the method to be used to handle missing data: Replace with average: missing values are replaced with the value fixed by the user for the corresponding attribute (for example, by means of a Data Manager). If this value is not set, the average computed on the training set is employed. Include in splits: patterns with missing values in the test attribute at a given node are sent to both the sub-nodes deriving from the split. Remove from splits: patterns with missing value in the test attribute are removed from the subsequent nodes.
Select the attribute to split before the value	selattfirst	If selected, the QUEST method is used to select the best split. According to this approach, the best attribute to split is selected via a correlation measure, such as F-test or Chi-Square. After choosing the best attribute, the best value for splitting is selected.
Aggregate data before processing	aggregate	If selected, identical patterns are aggregated and considered as a single pattern during the training phase.
Initialize random generator with seed	initrandom, iseed	If selected, a seed, which defines the starting point in the sequence, is used during random generation operations. Consequently using the same seed each time will make each execution reproducible. Otherwise, each execution of the same task (with same options) may produce dissimilar results due to different random numbers being generated in some phases of the process.
Append results	append	If selected, the results of this computation are appended to the dataset, otherwise they replace the results of previous computations.

Results

The results of the Decision Tree task can be viewed in two separate tabs:

The Monitor tab, where it is possible to view the statistics related to the generated rules as a set of histograms, such as the number of conditions, covering value, or error value. Rules relative to different classes are displayed as bars of a specific color. These plots can be viewed during and after computation operations.
The Results tab, where statistics on the DT computation are displayed, such as the execution time, number of rules etc.

Example

The following examples are based on the Adult dataset.

Scenario data can be found in the Datasets folder in your Rulex installation.

The scenario aims to solve a simple classification problem based on ranges on income.

The following steps were performed:

First we import the adult dataset with an Import from Text File task.
Split the dataset into a test, training and validation set with a Split Data task.
Generate rules from the dataset using the Decision Tree task.
Analyze the generated rules with a Rule Manager task.
Apply the rules to the dataset with an Apply Model task.
View the results of the forecast via the Take a look function.

Procedure	Screenshot

Procedure	Screenshot
After importing the adult dataset with the Import from Text File task and splitting the dataset into test and training sets (20% test, 20% validation and 60% training) with the Split Data task, add a Decision Tree task to the process and double click the task. Select: Cost-complexity as pruning method 0.5 as Maximum impurity in a leaf Income as the output attribute Compute the task to start the analysis.
The properties of the generated rules can be viewed in the Monitor tab of the Decision Tree task: There are, for example, 117 rules with 2 conditions, 78 relative to class "<50K, and 39 relative to class “>50K”. The total number of rules, and the minimum, maximum and average of the number of conditions is reported, too. Analogous histograms can be viewed for covering and error, by clicking on the corresponding tabs.
Clicking on the Results tab displays a spreadsheet with the execution time (only for the DT task), some input data properties, such as the number of patterns and attributes some results of the computation, such as number of rules and rule statistics.
The forecast ability of the set of generated rules can be viewed by adding an Apply Model task to the Decision Tree task, and computing with default options. If required, here we could apply weights to the execution, for example if we were more interested in identifying one of the two classes.
The rule spreadsheet that can be viewed by adding a Rule Manager task. Each row displays all the conditions that belong to the specific rule. The total number of generated rules is 1512, with a number of conditions ranging from 1 to 9. The maximum covering value is 67.2%, whereas the maximum error is about 14.9%.
We can check out the application of this set of rules to the training and test patterns by right-clicking the Apply Model task and selecting Take a look. The application of the rules generated by the Decision Tree task has added new columns containing: the forecast for each pattern: pred(income) the confidence relative to this forecast: conf(income) the number of rules used by each pattern rule(income) the most important rule that determined the prediction: nrule(income) the classification error, i.e. 1 if misclassified and 0 if correctly classified: err(income).
Selecting Test Set from the Displayed data drop down list shows how the rules behave on new data. In the test set, the percentage of accuracy is about 81.4%. Post-processing model optimization can improve test set accuracy (potentially) at the expense of a slightly higher error level on the training set.