How to pass model objects as arguement to a UDF in pyspark

Dhiraj Rai
4 min readMar 3, 2024

UDF are inevitable when we want to do distributed processing of our data using pyspark.

UDF’s is basically a python function that is distributed across multiple executors for processing.

UDF’s take a column or multiple columns from a pyspark dataframe and transform them to create a new column based on the logic that you might have defined in the function.

However, sometimes we come across a situation where we need to pass additional arguments to the UDF in addition to the columns in pyspark dataframe. These arguments may be strings, lists, numbers, etc. There are numerous examples available on the internet on how to pass additional arguments to a UDF

Sometimes these arguments have to be model objects or objects of a class. The objective of this blog post is to demonstrate how we can pass model objects to a UDF.

Pre-requisites

  1. Model object: I am assuming that you already have a model object that you need to pass to the UDF and within the UDF you want to fetch attributes or use methods using this model object. For the purpose of demonstration, I have a xgboost model object called “loaded_model”.
  2. Test dataset: We need to have a test data in csv format, on columns of which I will be applying the UDF.

Lets see the sample code.

Let is setup pyspark by using the below code.

Importing all the required packages

Writing a python function that latter needs to be converted to udf

As you can see the function expects two argumnts. The first one is the row of the spark dataframe, obviously. The second is the additional argument that is, the xgb model object “loaded_model”.

Now, transforming the python function into udf

As you can see I have used another function named “predict_udf2” as a wrapper to the predict_udf function. The wrapper function takes our “loaded_model” object as input and passes it to “predict_udf” Also, I have used a lambda function to do some jugglery.

Reading the test.csv file into pyspark dataframe

Loading the model (create loaded_model object)

I already have a model.joblib file stored somewhere in my local. I am using joblib to load the model and create “loaded_model” object

Applying the udf to the pyspark dataframe

pass_thru_cols: the columns from old dataframe that you want to reatin in the new dataframe (can be any cols)

attr_list: This is the list of columns from pyspark dataframe the our udf needs for transformation. In our case these are the feartures that the “loaded_model” needs for prediction in our udf. It can be all the columns of holdout_df.

pred_prob: This is the new column that will be created based on the output of our udf

The main point is that how I have passed the “loaded_model” which is an xgb object to the udf.

The same thing can be achieved using “withColumn” as well

Output:

You will see the rows and xgb model i.e. the loaded_model printed in the output

--

--