Using Logistic to Solve Classification Problems

The Logistic task solves classification problems according to the logistic regression approach, i.e. by approximating the probabilities associated with the different classes through a logistic function:

where

c is the number of output classes
j is the index of the class, ranging from 1 to c−1,
d is the number of inputs and
𝓌_ji is the weight for class _j and input _i .

The probability for the c -th class is obtained as . The optimal weight matrix 𝓌_ji is retrieved by means of a Maximum Likelihood Estimation (MLE) approach that makes use of the Newton-Raphson procedure to find the minimum of the log-like function.

The output of the task is the weights matrix 𝓌_ji, which can be employed by an Apply Model task to perform the Logistic forecast on a set of examples..

Prerequisites

a Rulex process has been created
the required datasets have been imported into the process
the data used for the model has been well prepared, and contains a categorical output value, and some input values.
a single unified dataset has been created by merging all the datasets imported into the process.

Additional tabs

Along with the Options tab, where the task can be configured, the following additional tabs are provided:

Documentation tab where you can document your task.
Parametric options tab where you can configure process variables instead of fixed values. Parametric equivalents are expressed in italics in this page (PO).
Coefficients and Results tabs, where you can see the output of the task computation. See Results table below.

Procedure

Drag and drop the Logistic task onto the stage.
Connect a task, which contains the attributes from which you want to create the model, to the new task.
Double click the Logistic task.
Drag and drop the input attributes, which will be used for regression, from the Available attributes list on the left to the Selected input attributes list.
Configure the options described in the table below.
Save and compute the task.

Parameter Name	PO	Description
Linear options
Normalization of input variables	normtype	The type of normalization to use when treating ordered (discrete or continuous) variables. Possible methods are: None: no normalization is performed (default) Normal: data are normalized according to the Gaussian distribution, where μ is the average of x and σ is its standard deviation: Minmax [0,1]: data are normalized to be comprised in the range [0,1]: Minmax [-1, 1]: data are normalized to be included in the range [-1, 1]: Every attribute can have its own value for this option, which can be set in the Data Manager task. These choices are preserved if Attribute is selected in the Normalization of input variables option; otherwise any selections made here overwrite previous selections made. Normalization types For further info on possible types see Managing Attribute Properties
P-value confidence (%)	linfitconfpval	Specify the value of the required confidence coefficient.
Regularization parameter	regparam	Specify the value of the regularization parameter which is added to the diagonal of the matrix.
Aggregate data before processing	aggregate	If selected, identical patterns are aggregated and considered as a single pattern during the training phase.
Initialize random generator with seed	initrandom, iseed	If selected, a seed, which defines the starting point in the sequence, is used during random generation operations. Consequently using the same seed each time will make each execution reproducible. Otherwise, each execution of the same task (with same options) may produce dissimilar results due to different random numbers being generated in some phases of the process.
Append results	append	If selected, the results of this computation are appended to the dataset, otherwise they replace the results of previous computations.
Output attribute (response variable)	linfitoutputname	Select the attribute which will be used to identify the output.
Weight attribute	linfitweightname	If specified, this attribute represents the relevance (weight) of each sample (i.e., of each row) .
Selected input attributes (drag from available attributes)	linfitselectednames	Drag and drop the input attributes you want to use to form the model leading to the correction classification of data.

Results

The results of the Logistic task can be viewed in:

the Results tab, where statistics such as the execution time, number of attributes etc. are displayed.
the Coefficients tab, where the coefficients matrix weight vector 𝓌_jirelative to the Logistic approximation is shown. Each row corresponds to an output class whereas columns are relative to a single input attributes.

Example

The following examples are based on the Adult dataset.

The scenario aims to solve a simple regression problem based on the hours per week people statistically work, according to such factors as their age, occupation and marital status.

The following steps were performed:

First we import the adult dataset with an Import from Text File task.
Split the dataset into a test and training set with a Split Data task.
Generate the model from the dataset with the Logistic task.
Apply the model to the dataset with an Apply Model task, to forecast the output associated with each pattern of the dataset.
Use the Take a look functionality to view the results.

Procedure	Screenshot

Procedure	Screenshot
After importing the adult dataset with the Import from Text File task and splitting the dataset into test (30% of dataset) and training (70% of dataset) sets with the Split Data task, add a Logistic task to the process and double click the task.
Configure the task as follows: Normalization for input variables: None Output attribute: income Drag and drop all remaining attributes onto the Selected input attributes list. Leave the remaining options with their default values and compute the task.
Once computation has completed, in the coefficients tab we have a single row containing the coefficients relative to the output class `<=50K`. In a case with c>2 output classes, the weight matrix contains c−1 rows each containing the coefficients relative to an output class. The probability of the last class is obtained as .
The Results tab contains a summary of the computation. Then add an Apply Model task to forecast the output associated with each pattern of the dataset.
To check how the model built by Logistic model has been applied to our dataset, right-click the Apply Model task and select Take a look. Two result columns have been added: The pred(income) column contains the output forecast generated by the Logistic model. The err(income) column contains the error, which corresponds to the difference between the predicted output and the real one. If the actual output is missing, this field is also left empty.