Slice a PySpark DataFrame into Two Row-wise DataFrames

PySpark Python Server Side Programming Programming

PySpark dataframe is defined as a collection of distributed data that can be used in different machines and generate the structure data into a named column. The term slice is normally used to represent the partitioning of data. In Python, we have some built-in functions like limit(), collect(), exceptAll(), etc that can be used to slice a PySpark dataframe in two row-wise dataframe.

Syntax

The following syntax is used in the examples ?

limit()

This is a built-in method in Python that can be used to set the range of rows by specifying the integer value.

subtract()

The subtract() method returns the result of rows in the form of a new dataframe that data will not contain in another dataframe.

collect()

Pyspark collect is used to retrieve all the elements from the given dataset and it is used through loops and variables.

createDataFrame()

This is a built-in method in Python that takes schema arguments to define the schemas of dataframe.

[: before_slicing] [after_slicing :]

The above representation is known for list slicing and it will be used to divide the dataframe into two row-wise.

head()

ln general head() method in Python represent the 5 rows from a data table but here it accepts some parameter and returns the result according to the given condition.

exceptAll()

This is a built-in function in Python that follows the PySpark module which returns the new DataFrame including rows in the DataFrame but not in the other DataFrame while keeping the duplicates.

count()

This is an in-built function in Python that will be used the return the specific number of rows as a result.

drop()

The drop method eliminates the specific rows or columns.

Window.orderBy()

PySpark window function is defined by calculating the result such as rows number or rank. The orderBy() is the unique method to partition the data.

Installation Requirement

pip install pyspark

This necessary command is used to install that helps to run the PySpark program.

Using Limit() and Subtract() Method

The limit() and subtract method is used to convert the single data into two row-wise dataframe. The limit() is used to set the specific number of rows by assigning an integer value to it whereas the subtract method can be used to contain the unique rows that are not present in another DataFrame.

Example

In the following example, we will first import pyspark and SparkSession module that will create the session of a dataframe. Then set the values as rows data in the variable rows. Next, set the columns values of data in the variable cols. Now use the method named createDataFrame() with SparkSession module to set the rows and cols which defines the two different schemas of a dataframe and stores it in the variable df_first. Then initialize the variable df_second which will set the values as a built-in function named subtract() that accepts the parameter named as a variable named df_first which will result to return the new dataframe. Finally, we use show() method for both the variables- df_first and df_second to get the result.

# Import the PySpark module
import pyspark
from pyspark.sql 
import SparkSession
# Create the session
Spark_Session = SparkSession.builder.appName(
   'EMPLOYEE DATA'
).getOrCreate()
# rows of Dataframe
rows = [['1', 'RAHUL', 'INDIA','1243'],
   ['2','PETER', 'SRI LANKA','5461'],
   [ '3',' JOHN', 'SOUTH KOREA','2224'],
   [ '4', 'MARK', 'NEWYORK','9985'],
   [ '5', 'SUNNY', 'BANGLADESH','8912']
]
# Columns of DataFrame
cols = ['S.N', 'EMPLOYEE NAME', 'COUNTRY', 'EMP_ID']
# DataFrame creation for rows and columns
df = Spark_Session.createDataFrame(rows, cols)
# Getting the first two slicing of rows
df_first = df.limit(2)
# Getting the second slicing by removing the variable df1
df_second = df.subtract(df_first)
# first slice with 2 rows with columns names
df_first.show()
# Second slice with 3 rows with columns names
df_second.show()

Output

+---+-------------+---------+------+
|S.N|EMPLOYEE NAME|  COUNTRY|EMP_ID|
+---+-------------+---------+------+
|  1|        RAHUL|    INDIA|  1243|
|  2|        PETER|SRI LANKA|  5461|
+---+-------------+---------+------+

+---+-------------+-----------+------+
|S.N|EMPLOYEE NAME|    COUNTRY|EMP_ID|
+---+-------------+-----------+------+
|  3|         JOHN|SOUTH KOREA|  2224|
|  5|        SUNNY| BANGLADESH|  8912|
|  4|         MARK|    NEWYORK|  9985|
+---+-------------+-----------+------+

Using Collect() and CreateDataFrame() Method

The collect method is used to retrieve all the elements from the given data whereas the createDataFrame() divides the two schemas of dataframe.

Note that, Schemas are defined by the structure of a table.

Example

In the following example, first, create the session by using SparkSession. Then initialize the variable data that will set the rows data in list format. Then use the method createDataframe() with spark that accepts the parameters- data(the given rows) and ["Name", "Age"] (to set the name of the columns). For getting the lists of rows it will use the collect() method as an object reference of the variable df and store it in the variable rows. Next, we are using two list slicing i.e. before and after in the variables- rows1 and rows2 respectively. Moving ahead to use the built-in method createDataframe() that accepts two parameters- name_of_rows(rows1 and rows2) and df.schema( set the dataframe of a schema) and stores it in the variable df1 and df2 respectively. Finally, it will use the show function with both these variables i.e. df1 and df2 to get the result.

from pyspark.sql 
import SparkSession
# Create the Spark session
spark = SparkSession.builder.appName("EMPLOYEE DATA").getOrCreate()
# Create the sample dataframe
data = [("Vivek", 31), ("Aman", 20), ("Sohan", 13), ("David", 24)]
df = spark.createDataFrame(data, ["Name", "Age"])
# Getting the list of row objects using the collect function
rows = df.collect()
# Getting two rows of the list by using slicing
rows1 = rows[:2]
rows2 = rows[2:]
# Convert the lists of Rows to PySpark DataFrames
df1 = spark.createDataFrame(rows1, df.schema)
df2 = spark.createDataFrame(rows2, df.schema)
# result
df1.show()
df2.show()

Output

+-----+---+

| Name|Age|
+-----+---+
|Vivek| 31|
| Aman| 20|
+-----+---+

+-----+---+
| Name|Age|
+-----+---+
|Sohan| 13|
|David| 24|
+-----+---+

Using Count(), Filter(), and, Drop() Method

In this program, dividing the dataframe into two row-wise dataframe needs count() and filter() method that divide the specific unique rows. The count() returns the total number of rows whereas filter() sets to divide the two different rows of a DataFrame. Then the Drop() method removes the rows that represent the partitioning of a dataframe.

Example

In the following example, first build the spark session and then set the rows data in the variable named data. Then set the columns name using createDataFrame() with spark that accepts two parameters- data( set the rows) and list(to set the column name) of a dataframe and store it in the variable df. Then use df.count() in the variable total_rows that will find the total number of rows. Next, define the number of rows for first dataframe in the variable n_rows_first_df. Then we add the row number column to the dataframe using built-in method row_number(), over(), and, Window.orderBy(). Now partition the dataframe into two different row-wise using the built-in method filter() and store it in their respective variable. Finally, it will use the two show() method with two different variables to get the result in form of two row-wise dataframe.

from pyspark.sql 
import SparkSession, Window
from pyspark.sql.functions import row_number
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Create the original DataFrame
data = [("Rabina", 35), ("Stephen", 31), ("Raman", 33), ("Salman", 44),("Meera",37)]
df = spark.createDataFrame(data, ["Name", "Age"])
# Get the total number of rows
total_rows = df.count()
# Define the number of rows for the first DataFrame
n_rows_first_df = 2
# Add a row number column to the DataFrame
df_with_row_number = df.withColumn("row_number", row_number().over(Window.orderBy("Name")))
# Slice the DataFrame into two using filter()
first_df = df_with_row_number.filter(df_with_row_number.row_number <= n_rows_first_df).drop("row_number")
second_df = df_with_row_number.filter(df_with_row_number.row_number > n_rows_first_df).drop("row_number")
# Show the resulting DataFrames
first_df.show()
second_df.show()

Output

+------+---+
|  Name|Age|
+------+---+
| Meera| 37|
|Rabina| 35|
+------+---+

+-------+---+
|   Name|Age|
+-------+---+
|  Raman| 33|
| Salman| 44|
|Stephen| 31|
+-------+---+

Using Head() and ExceptAll() Method

Divide the dataframe into two row-wise dataframe, it uses the two methods head() and exceptAll() that will be used to separate the two dataframe with unique data rows.

Example

In the following example, it uses the built-in method count to get the total number of rows. Then it assigns the number of rows for the first DataFrame in the variable n_rows_first_df. For creating two dataframes it will use three different built-in functions like head(), createDataFrame(), and, exceptAll(), and store it in their respective variable. In the end, it will use two show() functions to get the two row-wise dataframe.

from pyspark.sql 
import SparkSession

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Create the original DataFrame
data = [("karisma", 25), ("Bobby", 30), ("Champak", 35), ("Mark", 40)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Get the total number of rows
total_rows = df.count()

# Define the number of rows for the first DataFrame
n_rows_first_df = 2

# Slice the DataFrame into two using head() and exceptAll()
first_rows = df.head(n_rows_first_df)
first_df = spark.createDataFrame(first_rows, df.schema)
second_df = df.exceptAll(first_df)

# Show the resulting DataFrames
first_df.show()
second_df.show()

Output

+-------+---+
|   Name|Age|
+-------+---+
|karisma| 25|
|  Bobby| 30|
+-------+---+

+-------+---+
|   Name|Age|
+-------+---+
|Champak| 35|
|   Mark| 40|
+-------+---+

Conclusion

We discussed the four unique methods to slice a PySpark dataframe in two row-wise dataframe. All these methods have unique ways to represent the partitioning of dataframe. The PySpark dataframe is high-level interactive data that can be used by data engineers and data scientists. Python API for spark and ML are a common example to visualize the PySpark dataframe.

Tapas Kumar Ghosh

Updated on: 2023-07-17T16:52:47+05:30

835 Views

Kickstart Your Career

Get certified by completing the course

Get Started