Identifying the Principal Components in Datasets with PCA

View in Help Center (registration required)

The Principal Component Analysis task identifies the most important components in a dataset and consequently reduces the number of attributes a dataset contains. These components correspond to a linear combination of attributes that collect most variance in the values. The Principal Component Analysis basically compresses a large amount of data into a smaller number of attributes that capture the essence of the original data. To put it simply, think of our TVs that show us 3D people and places flattened into 2D viewing. Although a dimension is missing, we don’t lose much detail.

The first new “attribute” (called an eigenvector) represents the maximum variation in the data, the second eigenvector represents the second largest amount of variation and so on. In the Principal Component Analysis task in Rulex you can select how many eigenvectors you want to create in your new compressed dataset.

This function is extremely useful when dealing with datasets that have a very high number of attributes. This is useful to prepare large datasets for tasks such as clustering, neural networks and linear fit. The technique can also help to avoid overfitting in rules, where there are so many attributes that rules get be overly precise and articulate in the training set, and consequently not produce good results when applied to new data.

However, eigenvectors do not represent a single aspect of the original dataset, such as age or occupation, but a combination of these. The task does not subsequently result in the generation of immediately human understandable explainable rules. It is possible to analyze how much each original attribute influenced the eigenvectors in the rules, but this method is rather approximate and not particularly reliable. It would not make much sense, for example, to use this task with the Logic Learning Machine algorithm in Rulex.

Consequently, if you need to explain decisions, avoid using the Principal Component Analysis task in your workflow.

Prerequisites

a Rulex process has been created
the required datasets have been imported into the process.

The following additional tabs are provided:

Documentation tab where you can document your task,
Parametric options tab where you can configure process variables instead of fixed values. Parametric equivalents are expressed in italics in this page (PO).

Procedure

Drag and drop the Principal Component Analysis task onto the stage.
Connect a task, which contains the data you want to export, to the new task.
Double click the Principal Component Analysis task.
Configure the task options as described in the table below.
Save and compute the task.

Parameter Name	PO	Description

Parameter Name	PO	Description
Use previous eigenvectors for Principal Component Analysis execution	useprevious	If selected, the eigenvectors defined in the upstream PCA task will be used to create the required number of principle components.
Method for distance evaluation	distmethod	The method you want to use to compute distances between samples. The distance is computed as the combination of the distances for each attribute. Possible options are: Euclidean, Euclidean (normalized), Manhattan, Manhattan (normalized) and Pearson. See https://rulex.atlassian.net/wiki/spaces/DP/pages/1529251436 for details on these options.
Normalization	normtype	The type of normalization you want to use with ordered variables. Possible options are: None, Attribute, Normal, Minmax [0,1] and Minmax [-1,1]. See https://rulex.atlassian.net/wiki/spaces/DP/pages/1529251436 for details on these options.
Aggregate data before processing	aggregate	If selected, identical patterns will be aggregated and considered as a single pattern during the principal component analysis.
Minimum number of final components (0 means no minimum)	pcanred	The minimum number of final components the resulting dataset must contain. If this minimum number does not satisfy the minimum confidence specified for the analysis in the Minimum level of confidence for the resulting dataset option, sufficient components will be added to reach the minimum confidence level.
Minimum level of confidence for the resulting dataset (o means no minimum)	pcaloc	The minimum level of confidence the resulting dataset must have. If this minimum confidence level does not satisfy the minimum number of components specified for the analysis in the Minimum number of final components option, the confidence level may increase until the minimum number of components is also reached.
Attributes for principal component analysis	pcanames	Drag and drop here those attributes which you want to use in the principal component analysis. Principal Component Analysis cannot be performed on nominal values.