Feature Engineering in pyspark — Part I

The most commonly used data pre-processing techniques in approaches in Spark are as follows

1) VectorAssembler


3)Scaling and normalization

4) Working with categorical features

5) Text data transformers

6) Feature Manipulation

7) PCA

Please find the complete jupyter notebook here

This blog is mostly intended for who are new to Spark programming, specially pyspark. Therefore, this blog follows an illustrative approach to explain the commonly known data-preprocessing approaches.

But before we do anything, we need to initialize a Spark session. This can be done as follows

Now let us download some sample datasets to play along with. The datasets I have used in the illustrations can be downloaded from here

For the purpose of demonstration we will be using three different datasets

1) retail-data/by-day

2) simple-ml-integers

3) simple-ml

4) simple-ml-scaling

Lets us begin by reading the “retail-data/by-day” which is in .csv format and save it into a Spark dataframe named ‘sales’

Lets us read the parquet files in “simple-ml-integers” and make a Spark dataframe named ‘fakeIntDF’

Lets us read the parquet files in “simple-ml” and make a Spark dataframe named ‘simpleDF’

Lets us read the parquet files in “simple-ml-scaling” and make a Spark dataframe named ‘scaleDF’

Please find the complete jupyter notebook here

Vector assembler

The vector assembler is basically use to concatenate all the features into a single vector which can be further passed to the estimator or ML algorithm. In order to demo the ‘Vector Assembler’ we will use the ‘fakeIntDF’ which we had created in the previous steps.

Let us see what columns we have in ‘fakeIntDF’ dataframe

So we have three columns, int1, int2, int3 with integer values within them.

First, import the VectorAssembler from pyspark.ml.features. Once the Vector assembler is imported we are required to create the object of the same. Here I will create an object named “assembler”. The above result shows that we have three features in ‘fakeIntDF’ i.e. int1, int2, int3. Let us create the object assembler so as to combine the three features into a single column named features. At last, let us use the transform method to transform our dataset.

The output is as follows

The above results shows that a new column named “features” has been created which combined the “int1”, “int2”, “int3” into a single vector.

Please find the complete jupyter notebook here


Bucketing is a most straight forward approach for fro converting the continuous variables into categorical variable. In pyspark the task of bucketing can be easily accomplished using the Bucketizer class.

Firstly, we need to create bucket borders. Let us define a list bucketBorders =[-1.0, 5.0,10.0,250.0,600.0]

Next, let us create a object of the Bucketizer class. Then we will apply the transform method to our target Dataframe “dataframe”

Let us create a simple dataframe for demo purpose

Then we can import the Bucketizer class, create an object of this class, and apply the transform method to the dataframe we just created.

The output looks as follows:

Scaling and normalization

Scaling and normalization is another common task that we come across while handling continuous variables. It is not always imperative but, highly recommended to scale and normalize the features before applying an ML algorithm in order to avert the risk of an algorithm being insensitive to a certain features.

Spark ML provides us with a class “StandardScaler” for easy scaling and normalization of features

Let us use the scaleDF dataframe that we had created previously to demonstrate the StandardScaler in pyspark.

The output looks as follows:


The StandardScaler standardizes the features with a zero mean and standard deviation of 1. Sometimes, we come across situations where we need to scale values within a given range (i.e. max and min). For such task Spark ML provdies a class named MinMaxScaler.

The StandardScaler and MinMaxScaler share the common soul, the only difference is that we can provide the minimum value and maximum values within which we wish to scale the features.

For the sake of illustration, let us scale the features in the range 5 to 10.

The output looks like this:


Sometimes we need to scalerize features between -1 to 1. The MinAbsScaler does exactly this by dividing the features by the maximum absolute values.

The output looks like this:

So it is observed that all the features got scaled between -1 to 1.


What differentiates ElementwiseProduct from the previously mentioned scalizers is the fact that, in ElementwiseProduct the features are scaled based on a multiplying factor.

The below mentioned code snippet will transform the feature#1 → 10 times, feature#2 → 0.1 times and feature#3 → -1 times

For example → the features [10, 20, 30] if scaled by [10, 0.1, -1] will become [100, 2.0, -30]

The output looks like this:

So it is observed that all the features got scaled between -1 to 1.


The normalizer allows the user to calculate distance between features. The most commonly used distance metrics are “Manhattan distance” and the “Euclidean distance”. The Normalizer takes a parameter “p” from the user which represents the power norm.

For example, p = 1 indicates L1 norm; p = 2 indicates L2 norm.

In fact, p can also be infinity.

The output looks as follows:

Please find the complete jupyter notebook here

StringIndexer (Converting strings to numerical values)

Most of the ML algorithms require converting categorical features into numerical ones.

Sparks StringIndexer maps strings into different numerical values. We will use the simpleDF dataframe for demo purpose which consist of a feature “lab” which is a categorical variable.

Let us apply string indexer to a categorical variable named “lab” in “simpleDF” DataFrame.

The output looks as follows. It can be seen that a new column is added to the dataframe, which contains encoded values for “good” and “bad” , like, 1.0 for good and 0.0 for bad.


Sometimes we come across situations where it is necessary to convert the indexed values back to text. To do this the Spark ML provides a class “IndextoString”. To demonstrate the “IndextoString” let us use the “LabelIndexed” column of “idxRes” dataframe which was created in the previous code snippet.

The LabelIndexed column consists of 1.0 → good and 0.0 → bad. Now let us try and reverse this

The output looks as follows. It can be seen that a new column named “ReverseIndex” is now added to the dataframe “idxRes” which, obviously reverses the values in “LabelIndexed” column to text. E.g. 1.0 →good and 0.0 → bad

Indexing within Vectors

Spark offers yet another class named “VectorIndexer”. The “VectorIndexer” identifies the categorical variables with a set of features which is already been vectorized and converts it into a categorical feature with zero based category indices.

For the purpose of illustration let us first create a new DataFrame named “dataln”with features in the form of Vectors.

The dataframe contains three rows and two columns. 1) Three features in the form of Vectors, here the first feature is a categorical variable 2) labels.

Now let us use the “VectorIndexer” to automatically identify the categorical feature and index it.

There are a few more feature engineering techniques, please refer my python notebook for the same at this github link.


In this humble blog I have tried to cover some basic but widely used data preprocessing transformations offered by Spark ML. I have demonstrated each of them with an illustration. However, there is a plethora of Spark tools to aid the feature engineering task some of these I will try to cover them in my next blog. Also, I will try to implement an end-to-end classification model in pyspark in my next blog.


Chambers, B., & Zaharia, M., 2018. Spark: The definitive guide. “ O’Reilly Media, Inc.”.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store