Joining Datasets

The Join task in Rulex merges two datasets by their rows, creating a single table with all the data.

Merge operations can be used to produce a global dataset from on which in-depth analyses can be performed.

All merge operations are performed around the key attributes, which act as a unique identifier of each row in a table.

Rulex offers a full range of join options:

 

 

Prerequisites

Additional tabs

The following additional tabs are provided:


Procedure

  1. Drag and drop the Join task onto the stage.

  2. Connect the tasks that contain the datasets you want to merge to merge the Join task.

  3. Double-click the Join task.

  4. Select the attributes you want to include in the merged table from the list of attributes on the left (lkeepnames) and right side tables (rkeepnames).

  5. Configure the options described in the table below.

  6. Save and compute the task.

Join options

Name

PO

Description

Join type

jointype

Select the combination method you want to use from the Join type drop-down menu. Possible types are:

  • Inner join (0): only the pairs of samples whose values of the key attributes values satisfy all the conditions listed in the matching attributes panels will be included in the final dataset.

  • Left outer join (1): the final dataset includes all the records from the dataset on the left even if the join condition does not find matching records in the right dataset. 

    If a row in the left-hand table does not match any row in the right-hand table the columns relative to the right-hand table will present empty cells.

  • Right outer join (2): the final dataset includes all the records from the dataset on the right. If a row in the right-hand table does not match any row in the left-hand table the columns relative to the left-hand table will present empty cells.

  • Full outer join (3): the union of the tables produced by left outer join and right outer join is included in the final dataset.

  • Left complement (4): the final dataset includes all the records from the left-hand dataset that do not satisfy the conditions listed in the matching attribute panels. Since the rows in the left-hand table do not match any records in the right-hand table, the columns relative to the right-hand table will all be empty.

  • Right complement (5): the final dataset includes all the records from the right-hand dataset that do not satisfy the conditions listed in the matching attribute panels. Since the rows in the right-hand table do not match any records in the left-hand table, the columns relative to the left-hand table will all be empty.

  • Full complement (6): the final dataset includes all the records from both datasets that do not satisfy the conditions listed in the matching attribute panels. 

Missing policy

misspolicy

Select the policy you want to adopt for managing missing values when evaluating the constraints on the matching attributes:

  • Missing values are considered normal values (0): missing values are considered missing, and are matched only with other missing values.

  • Missing values always satisfy equality checks (1): missing values are always considered as matching.

  • Missing values never satisfy equality checks (2): missing values are never considered as matching.

Merge type

mergetype

This option manages attributes with the same name (or the same position according to the Match columns by name options) that are present in both datasets.

Possible merge types are:

  • None (0): if the attributes are not merged, a separate column will be created in the resulting dataset for the left-hand and right-hand dataset attributes.

  • Left fill (1): the value in the right-hand dataset overwrites the value in the left-hand dataset. If there is no value in the right-hand dataset for the attribute, the corresponding value in the left-hand dataset is kept.

  • Right fill (2): the value in the left-hand dataset overwrites the value in the right-hand dataset. If there is no value in the left-hand dataset for the attribute, the corresponding value in the right-hand dataset is kept.

Match columns by name

byname

If selected, columns with the same name are considered equal (default), rather than in the same position.

Matching attributes for joining datasets

lkeynamesrkeynames, oplist, parlist

Drag and drop the key attributes you want to use to combine the datasets from the left-hand dataset (lkeynames) and right-hand dataset (rkeynames) onto the Matching attributes panel below, specifying the relationship that must exist between each pair of attributes with an operator (oplist) and value (parlist). If no attributes are selected row numbers are employed to combine datasets. A search and order function is provided to make it easier to find attributes when the list is long:

  • Search attribute, which allows you to search the list of attributes for a specific string.

  • Order attributes by, which allows you to sort attributes in the original order, in alphabetical order or according to their type, ignored value, number of values and role.

More than one attribute can be defined as key, however the number of attributes must be the same in both panels.

 

Insert left key value when missing or Insert right key value when missing

leftfill,

rightfill



Select Insert left key value when missing or Insert right key value when missing if you want all the key attributes that are not present in a dataset to be filled with the corresponding key value in the other dataset.

 

Example

The following examples are based on the Northwind dataset.

 

Scenario data can be found in the Datasets folder in your Rulex installation.

 

The following steps were performed:

  1. We import the northwind_customers.set dataset.

  2. We then import the northwind_orders.set dataset.

  3. A Join task is added to create a single dataset.

  4. Use the Take a look functionality to check the joined datasets.

 

Procedure

Screenshot

Procedure

Screenshot

Import the first northwind_customers.set dataset, and compute the task.

To check its data, right-click the task and select Take a look.

The dataset is related to customers and contains:

  • 91 records

  • 11 nominal attributes.



northwind_orders is related to orders and contains

  • 830 records

  • 14 attributes of various data types.



Add a Join task to the process. If the task is executed with the default options no attributes are present in the matching panels and therefore the row numbers are employed to combine datasets.

In this case the resulting table has 91 records with 25 attributes.

Note that the name CustomerID of the first attribute deriving from the right dataset has been changed to Customer_ID_1 to avoid a conflict with the name of the first attribute.

If the Right outer join is selected as the Join type, the resulting dataset has 830 rows (same as the right table) and all the attributes from the left dataset are filled with missing values from rows 92 on.

If the CustomerID and City attributes from the dataset on the left are matched with the CustomerID and ShipCity attributes from the dataset on the right, and Inner join is chosen as the Join type, the resulting dataset includes 817 records with 25 attributes.

Only 817 different values in the CustomerID/ShipCity attribute pair from the right dataset match the values in the CustomerID and City attribute pair from the left dataset.

If the Full complement is selected as the Join type for task join1, the resulting dataset includes 16 records.



Need to get in touch? https://www.rulex.ai/contact/ - Need a license renewal? https://rulex.atlassian.net/servicedesk/customer/user/login?destination=portals
© 2024 Rulex, Inc. All rights reserved.