Applying function to PySpark Dataframe Column
Last Updated :
26 Apr, 2025
In this article, we're going to learn 'How we can apply a function to a PySpark DataFrame Column'.
Apache Spark can be used in Python using PySpark Library. PySpark is an open-source Python library usually used for data analytics and data science. Pandas is powerful for data analysis but what makes PySpark more powerful is its capacity to handle big data.
Note: In the article about installing Pyspark we have to install python instead of scala rest of the steps are the same.
Install required module
Run the below command in the command prompt or terminal to install the Pyspark and pandas modules:
pip install pyspark
pip install pandas
Applying a Function on a PySpark DataFrame Column
Herein we will look at how we can apply a function on a PySpark DataFrame Column. For this purpose, we will be making use of 'pandas_udf()' present in 'pyspark.sql.functions'.
Syntax:
# defining function
@pandas_udf('function_type')
def function_name(argument: argument_type) -> result_type:
function_content
# applying function
DataFrame.select(function_name(specific_DataFrame_column)).show()
Example 1: Adding 's' to every element in the column of DataFrame
Here in, we will be applying a function that will return the same elements but an additional 's' added to them. Let's look at the steps:
- Import PySpark module
- Import pandas_udf from pyspark.sql.functions.
- Initialize the SparkSession.
- Use the pandas_udf as the decorator.
- Define the function.
- Create a DataFrame.
- Use .select method over the DataFrame
- Â and as its argument, type-in the function_name along with its parameter as the specific column you want to apply the function on.
Python3
# importing SparkSession to initialize session
from pyspark.sql import SparkSession
# importing pandas_udf
from pyspark.sql.functions import pandas_udf
# importing Row to create DataFrame
from pyspark import Row
# initialising spark session
spark = SparkSession.builder.getOrCreate()
# creating DataFrame
df = spark.createDataFrame([
Row(fruits='apple', quantity=1),
Row(fruits='banana', quantity=2),
Row(fruits='orange', quantity=4)
])
# printing our created DataFrame
df.show()
Output:
+------+--------+
|fruits|quantity|
+------+--------+
| apple| 1|
|banana| 2|
|orange| 4|
+------+--------+
Now, let's apply the function to the 'fruits' columns of this DataFrame.
Python3
# pandas UDF with the function Type as 'String'
@pandas_udf('string')
def adding_s(s: pd.Series) -> pd.Series: # function
return (s +'s') # concatenating the element string and 's'
# applying the above function on the
# 'fruits' column of 'df' DataFrame
df.select(adding_s('fruits')).show()
Output:
+----------------+
|adding_s(fruits)|
+----------------+
| apples|
| bananas|
| oranges|
+----------------+
Example 2: Capitalizing each element in the 'fruits' column
Herein, we will capitalize each element in the 'fruits' columns of the same DataFrame from the last example. Let's look at the steps to do that:
- Use the pandas_udf as the decorator.
- Define the function.
- Use .select method over the DataFrame and as its argument, type-in the function_name along with its parameter as the specific column we want to apply the function on.
Python3
@pandas_udf('string')
def capitalize(s1: pd.Series) -> pd.Series:
# Here we are using s1.'str'.capitalize() as
# s1 is a pandas Series object and it
# doesn't contain capitalize() method.
# It is a string method, that's why we have written
# s1.str.capitalize()
return (s1.str.capitalize())
df.select(capitalize('fruits')).show()
Output:
+------------------+
|capitalize(fruits)|
+------------------+
| Apple|
| Banana|
| Orange|
+------------------+
Example 3: Square of each element in the 'quantity' column of 'df' DataFrame
Herein, we will create a function that will return the squares of numbers in the 'quantity' column. Let's look at the steps:
- Import Iterator from typing.
- Use pandas_udf() as Decorator.
- Define the Function.
- Use .select method over the DataFrame and as its argument, type-in the function_name along with its parameter as the specific column you want to apply the function on.
Python3
from typing import Iterator
@pandas_udf('long')
def square(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
for x in iterator:
yield x*x
df.select(square('quantity')).show()
Output:
+----------------+
|square(quantity)|
+----------------+
| 1|
| 4|
| 16|
+----------------+
Example 4: Multiplying Each element of 'quantity' column with 10
We will follow all the same steps as above but we will change the function slightly.
Python3
from typing import Iterator
@pandas_udf('long')
def multiply_by_10(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
for x in iterator:
# multiplying each element by 10
yield x*10
df.select(multiply_by_10('quantity')).show()
Output:
+------------------------+
|multiply_by_10(quantity)|
+------------------------+
| 10|
| 20|
| 40|
+------------------------+
Similar Reads
How to Change Column Type in PySpark Dataframe ?
In this article, we are going to see how to change the column type of pyspark dataframe. Creating dataframe for demonstration: [GFGTABS] Python # Create a spark session from pyspark.sql import SparkSession spark = SparkSession.builder.appName('SparkExamples').getOrCreate() # Create a spark d
4 min read
How to change dataframe column names in PySpark ?
In this article, we are going to see how to change the column names in the pyspark data frame. Let's create a Dataframe for demonstration: [GFGTABS] Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName('pyspar
3 min read
Apply function to each row of Spark DataFrame
Spark is an open-source, distributed computing system used for processing large data sets across a cluster of computers. It has become increasingly popular due to its ability to handle the big data processing in real-time. Spark's DataFrame API, which offers a practical and effective method for carr
8 min read
Applying Lambda functions to Pandas Dataframe
In Python Pandas, we have the freedom to add different functions whenever needed like lambda function, sort function, etc. We can apply a lambda function to both the columns and rows of the Pandas data frame. Syntax: lambda arguments: expression An anonymous function which we can pass in instantly w
6 min read
Select columns in PySpark dataframe
In this article, we will learn how to select columns in PySpark dataframe. Function used: In PySpark we can select columns using the select() function. The select() function allows us to select single or multiple columns in different formats. Syntax: dataframe_name.select( columns_names ) Note: We
4 min read
How to Order PysPark DataFrame by Multiple Columns ?
In this article, we are going to order the multiple columns by using orderBy() functions in pyspark dataframe. Ordering the rows means arranging the rows in ascending or descending order, so we are going to create the dataframe using nested list and get the distinct data. orderBy() function that sor
2 min read
Pivot String column on Pyspark Dataframe
Pivoting in data analysis refers to the transformation of data from a long format to a wide format by rotating rows into columns. In PySpark, pivoting is used to restructure DataFrames by turning unique values from a specific column (often categorical) into new columns, with the option to aggregate
4 min read
PySpark - Select Columns From DataFrame
In this article, we will discuss how to select columns from the pyspark dataframe. To do this we will use the select() function. Syntax: dataframe.select(parameter).show() where, dataframe is the dataframe nameparameter is the column(s) to be selectedshow() function is used to display the selected
2 min read
How to delete columns in PySpark dataframe ?
In this article, we are going to delete columns in Pyspark dataframe. To do this we will be using the drop() function. This function can be used to remove values from the dataframe. Syntax: dataframe.drop('column name') Python code to create student dataframe with three columns: [GFGTABS] Python3 #
2 min read
Python PySpark - DataFrame filter on multiple columns
In this article, we are going to filter the dataframe on multiple columns by using filter() and where() function in Pyspark in Python. Creating Dataframe for demonestration: [GFGTABS] Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import S
2 min read