Using Sequence Analysis to Solve Association Problems

Rulex extracts frequent sequences from event logs with the Sequence Analysis task.

Prerequisites

a Rulex process has been created
a dataset containing an event log has been imported into the process
the data used for the model has been well prepared

Additional tabs

The following additional tabs are provided:

Documentation tab where you can document your task,
Parametric options tab where you can configure process variables instead of fixed values. Parametric equivalents are expressed in italics in this page (PO).
Frequent sequences & Results tabs, where you can see the output of the task computation. See Results table below.

Procedure

Drag and drop the Sequence Analysis task onto the stage.
Connect a task which contains event log data to the task Sequence Analysis task.
Double click the Sequence Analysis task. The left-side pane displays a list of all the available attributes in the dataset, which can be ordered and searched as required.
Configure the basic and advanced options as described in the table below.
Save and compute the task.

Sequence Analysis Basic options

Sequence Analysis Basic options
Parameter Name	PO	Description
Minimum event support (#samples)	supth	All events which appear in orders fewer times than this threshold are discarded. This value is relevant only if the Auto (specify #events) option is not selected.
Auto (specify #events)	mbaspecnitem	If this option is selected, the minimum support for events is automatically computed: the user shall specify the number of events to take into account (most frequent first).
#Events to consider	mbanitemsup	Number of events to take into account (most frequent first). This value is relevant only if the Auto (specify #events) option is selected.
Minimum sequence support (#samples)	assupth	All sequences which are verified fewer times than this threshold are discarded. This value is relevant only if the Auto (above average) option is not selected.
Auto (above average)	abavassupth	If this option is selected, the minimum sequence support is set to the average support of sequences with the same dimension (i.e. constituted by the same number of events).
Maximum sequence cardinality	fitmaxdim	Maximum cardinality of generated sequences.
No maximum sequence cardinality	fitnomaxdim	If this option is selected, all sequences with higher support than the specified threshold are generated, regardless of their cardinality.
Time attribute	seqname, timeunit	Attribute including the timestamp for each of the events. The reference time unit can also be specified via the drop-down menu.
Minimum and maximum interval between sequence elements	sanminintseq, sanmaxintseq	Consecutive events in sequences are bound to these minimum and maximum thresholds of temporal distance.
Allow repetitions (the same event can occur more than one time in a sequence)	sanallrep	If this option is selected, repetitions of the same event in a single sequence are allowed.
Only print cyclic sequences (start event and end event have the same ID)	sanonlycic	If this option is selected, the output is constituted only by the sequences in which the first event is characterized by the same ID as the last one.
Sequence ID attributes (NOMINAL)	mbaorderkeynames	Drag and drop here the nominal attributes which identify the sequences. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.
Event ID attributes (NOMINAL)	mbaitemkeynames	Drag and drop here the nominal attributes which characterize the events. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.
Sequence Analysis Advanced options
Attribute to filter to select relevant data		Drag and drop here the attribute you want to use as a filter to select relevant data, from the Available attributes or Proximity attributes lists and configure the filter in the attribute filter dialog box. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.
Attribute to filter to discard irrelevant data		Drag and drop here the attribute you want to use as a filter to discard irrelevant data, from the Available attributes or Proximity attributes lists and configure the filter in the attribute filter dialog box. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.
Proximity attributes	mbaitemchildnames	Drag and drop here the ordered item attributes which characterize the proximity among events together with time onto the Proximity attributes list (mbaitemchildnames), and then set the corresponding thresholds in the Minimum-maximum proximity thresholds edit box. For example, if you need to mine frequent sub-sequences of events which occur in locations close to each other, spatial coordinates shall be dragged in this list. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.
Minimum-maximum proximity thresholds		Set the minimum and maximum proximity thresholds for the corresponding attribute in the Proximity attributes edit box.

Results

The results of the Sequence Analysis task can be viewed in two separate tabs (the respective columns are described in the table below):

The Frequent sequences tab, where it is possible to view the data resulting from the anomaly detection execution:
- Frequent Sequence ID: sequential ID number for the frequent sequence.
- Cardinality: number of events that make up the frequent sequence.
- Couple characterization: Qualitative characterization of the behavior for the sequence of two events A-B. The possible outcomes are:
  - Weak sequence - B is likely to follow A, A is indifferent to B,
  - Strong sequence - B is likely to follow A, A is unlikely to follow B,
  - Complements - B is likely to follow A and vice-versa,
  - Substitutes - B is unlikely to follow A and vice-versa,
  - Independents - B is indifferent to A and vice-versa, or
  - Not enough information to determine.
- #Occurrences: number of times in which the sequence is retrieved in the data.
- Confidences: Ratio of cases (0-1 value) in which, if the initial part of the sequence is verified, the final part follows. The first column of confidence is referred to the initial event, i.e. measures how often, if the initial event happens, the rest of the sequence follows. If a Maximum sequence cardinality higher than 2 is set, other columns are also generated, representing how often if the first two events are verified the other follow and so on.
- All-confidence: Ratio between the number of occurrences of the whole sequence and the number of occurrences of the least frequent event included in the sequence.
- Minimum time interval, Maximum time interval, Average time interval, Std time interval: Minimum, maximum, average and std interval of occurrences associated to the frequent sequence.
- Event IDs: IDs of the events constituting the frequent sequence.
The Results tab, where statistics on the task computation are displayed, such as the number of anomalies detected:
- Task identifier: ID code for the task, internally used by the Rulex engine.
- Task name: name of the task.
- Elapsed time: time required for latest computation (in seconds).
- Number of different events in input: number of distinct events which were fed to the task during latest computation.
- Number of different sequences in input: number of distinct sequences which were fed to the task during latest computation.
- Number of detected frequent sequences: number of events labeled as anomalies by the task.
- Number of generated frequent sequences: number of sequences which were found to be frequent, according to the support threshold.
- Minimum event support: minimum number of occurrences for frequent events.

Example

In the example process, frequent sequences are extracted from a dataset with the Sequence Analysis task.

Scenario data can be found in the Datasets folder in your Rulex installation.

The following steps were performed:

First we import the dataset.
The dataset is rearranged in the Reshape To Long task.
Frequent sequences are extracted with the Sequence Analysis task.
The relative results are viewed via the Take a look functionality.

Procedure	Screenshot

Procedure	Screenshot
First we import the san-test dataset, retrieving attribute names from line 1 and attribute types from line 2. Each row of the dataset represents a sequence, composed by Sequence ID, the date of occurrence, and a variable number of Event IDs.
Then add a Reshape to Long task to the process to re-arrange the dataset, so that the information concerning a purchase of N items is distributed over N rows, with each row including a Order ID/Item ID pair.
Then, we connect the Sequence Analysis task to the Reshape to Long task. Configure the task as follows: Drag and drop the Sequence ID attribute in the Sequence ID attributes list and the Wide_1 attribute in the Event ID attributes list. Select the Auto option (to the right) for the Minimum event support. Set the #Events to consider to 30 (if you have problems setting this number deselect and reselect the Auto option above). Deselect the Auto (above average) option for Minimum sequence support (#samples) and set the value to 10. Set the Maximum sequence cardinality to 2. Select Date as the Time attribute (and day as the unit of measure). Set the Minimum and maximum interval between sequence elements respectively to 0 and 1
The extracted frequent sequences can be seen in the Frequent Sequences tab.