Data Transformation Tools

Estimated read time 4 min read

Raw data is often not in an ideal format for applying standard machine learning algorithms. For example, when predicting purchase amounts based on various attributes, including gender, the categorical attribute ‘gender’ must be converted into numeric values through a process called dichotomization. This involves creating two new variables: ‘Gender = Male’ and ‘Gender = Female,’ each taking numeric values of 0 or 1.

In other cases, numeric data may need to be converted into categorical or nominal attributes. For instance, to use logistic regression to predict whether the market price of a home will exceed a certain threshold, the numeric price data must be converted into a binomial variable.

Transforming data types is a common and essential step in data preparation. The most commonly used data type conversion operators include:


  • Numerical to Binominal: Converts numeric attributes to binary types, where binominal attributes can have only two possible values: true or false. For example, if the threshold market price is set at $30,000, prices from $0 to $30,000 are mapped to false, and prices above $30,000 are mapped to true.
  • Nominal to Binominal: Converts a nominal attribute with multiple values into separate binominal attributes. For instance, a nominal attribute ‘Outlook’ with values ‘sunny,’ ‘overcast,’ and ‘rain’ becomes three binominal attributes: ‘Outlook = sunny,’ ‘Outlook = overcast,’ and ‘Outlook = rain,’ each of which can be true or false.
  • Nominal to Numerical: Similar to the Nominal to Binominal operator but produces numeric outputs. Using the ‘Dummy coding’ option, nominal values are replaced with 0/1 (binary) values. The ‘unique integers’ option assigns each nominal value a unique integer starting from 0.
  • Numerical to Polynominal: Changes the type and internal representation of selected attributes so each unique numeric value becomes a possible value for the polynominal attribute. For example, a ‘Temperature’ attribute with unique values ranging from 64 to 85 would be transformed into unique nominal values.
  • Discretization: When converting numeric attributes to polynominal, discretization groups numeric values into bins to avoid creating an excessive number of unique nominal values. For example, the ‘Temperature’ attribute can be discretized into equal-sized bins or user-defined ranges.


Sometimes, a dataset may need to be transformed or rotated around an attribute, a process known as pivoting or creating pivot tables. For example, a table with attributes for customer ID, product ID, and a numeric measure like Consumer Price Index (CPI) can be rearranged using the Pivot operator. This creates columns corresponding to product IDs, with CPI data aggregated by customer IDs.

The De-pivot operator reverses this process, converting a pivot table back into a relational structure.

The Append operator adds new examples (rows) to an existing dataset, provided they match the attributes of the main dataset. The Join operator combines two datasets with the same observation units but different attributes, offering traditional inner, outer, left, and right join options.

Other commonly used operators include Rename attributes, Select attributes, Filter examples, Add attributes, and Attribute weighting.

Sampling and Missing Value Tools

Sampling remains relevant even in the era of big data, particularly when dealing with imbalanced class representations. For example, in fraud prediction, where fraudulent examples might account for only 1-3% of all data, models trained on such imbalanced data often perform poorly on the minority class.

To address this, datasets can be balanced by undersampling the majority class and oversampling the minority class. RapidMiner provides several built-in processes for sampling, including bootstrapping, stratified sampling, model-based sampling, and Kennard-Stone sampling. Bootstrapping involves repeatedly sampling from a base dataset with replacement, creating non-unique examples.

For handling missing values, RapidMiner offers the Replace Missing Values operator, which replaces missing values with the minimum, maximum, average, zero, or a user-specified value. The Impute Missing Values operator provides a more advanced approach, predicting missing values based on other attributes by treating the attribute with missing values as a label or target variable.

+ There are no comments

Add yours