
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Slice a PySpark DataFrame into Two Row-wise DataFrames
PySpark dataframe is defined as a collection of distributed data that can be used in different machines and generate the structure data into a named column. The term slice is normally used to represent the partitioning of data. In Python, we have some built-in functions like limit(), collect(), exceptAll(), etc that can be used to slice a PySpark dataframe in two row-wise dataframe.
Syntax
The following syntax is used in the examples ?
limit()
This is a built-in method in Python that can be used to set the range of rows by specifying the integer value.
subtract()
The subtract() method returns the result of rows in the form of a new dataframe that data will not contain in another dataframe.
collect()
Pyspark collect is used to retrieve all the elements from the given dataset and it is used through loops and variables.
createDataFrame()
This is a built-in method in Python that takes schema arguments to define the schemas of dataframe.
[: before_slicing] [after_slicing :]
The above representation is known for list slicing and it will be used to divide the dataframe into two row-wise.
head()
ln general head() method in Python represent the 5 rows from a data table but here it accepts some parameter and returns the result according to the given condition.
exceptAll()
This is a built-in function in Python that follows the PySpark module which returns the new DataFrame including rows in the DataFrame but not in the other DataFrame while keeping the duplicates.
count()
This is an in-built function in Python that will be used the return the specific number of rows as a result.
drop()
The drop method eliminates the specific rows or columns.
Window.orderBy()
PySpark window function is defined by calculating the result such as rows number or rank. The orderBy() is the unique method to partition the data.
Installation Requirement
pip install pyspark
This necessary command is used to install that helps to run the PySpark program.
Using Limit() and Subtract() Method
The limit() and subtract method is used to convert the single data into two row-wise dataframe. The limit() is used to set the specific number of rows by assigning an integer value to it whereas the subtract method can be used to contain the unique rows that are not present in another DataFrame.
Example
In the following example, we will first import pyspark and SparkSession module that will create the session of a dataframe. Then set the values as rows data in the variable rows. Next, set the columns values of data in the variable cols. Now use the method named createDataFrame() with SparkSession module to set the rows and cols which defines the two different schemas of a dataframe and stores it in the variable df_first. Then initialize the variable df_second which will set the values as a built-in function named subtract() that accepts the parameter named as a variable named df_first which will result to return the new dataframe. Finally, we use show() method for both the variables- df_first and df_second to get the result.
# Import the PySpark module import pyspark from pyspark.sql import SparkSession # Create the session Spark_Session = SparkSession.builder.appName( 'EMPLOYEE DATA' ).getOrCreate() # rows of Dataframe rows = [['1', 'RAHUL', 'INDIA','1243'], ['2','PETER', 'SRI LANKA','5461'], [ '3',' JOHN', 'SOUTH KOREA','2224'], [ '4', 'MARK', 'NEWYORK','9985'], [ '5', 'SUNNY', 'BANGLADESH','8912'] ] # Columns of DataFrame cols = ['S.N', 'EMPLOYEE NAME', 'COUNTRY', 'EMP_ID'] # DataFrame creation for rows and columns df = Spark_Session.createDataFrame(rows, cols) # Getting the first two slicing of rows df_first = df.limit(2) # Getting the second slicing by removing the variable df1 df_second = df.subtract(df_first) # first slice with 2 rows with columns names df_first.show() # Second slice with 3 rows with columns names df_second.show()
Output
+---+-------------+---------+------+ |S.N|EMPLOYEE NAME| COUNTRY|EMP_ID| +---+-------------+---------+------+ | 1| RAHUL| INDIA| 1243| | 2| PETER|SRI LANKA| 5461| +---+-------------+---------+------+ +---+-------------+-----------+------+ |S.N|EMPLOYEE NAME| COUNTRY|EMP_ID| +---+-------------+-----------+------+ | 3| JOHN|SOUTH KOREA| 2224| | 5| SUNNY| BANGLADESH| 8912| | 4| MARK| NEWYORK| 9985| +---+-------------+-----------+------+
Using Collect() and CreateDataFrame() Method
The collect method is used to retrieve all the elements from the given data whereas the createDataFrame() divides the two schemas of dataframe.
Note that, Schemas are defined by the structure of a table.
Example
In the following example, first, create the session by using SparkSession. Then initialize the variable data that will set the rows data in list format. Then use the method createDataframe() with spark that accepts the parameters- data(the given rows) and ["Name", "Age"] (to set the name of the columns). For getting the lists of rows it will use the collect() method as an object reference of the variable df and store it in the variable rows. Next, we are using two list slicing i.e. before and after in the variables- rows1 and rows2 respectively. Moving ahead to use the built-in method createDataframe() that accepts two parameters- name_of_rows(rows1 and rows2) and df.schema( set the dataframe of a schema) and stores it in the variable df1 and df2 respectively. Finally, it will use the show function with both these variables i.e. df1 and df2 to get the result.
from pyspark.sql import SparkSession # Create the Spark session spark = SparkSession.builder.appName("EMPLOYEE DATA").getOrCreate() # Create the sample dataframe data = [("Vivek", 31), ("Aman", 20), ("Sohan", 13), ("David", 24)] df = spark.createDataFrame(data, ["Name", "Age"]) # Getting the list of row objects using the collect function rows = df.collect() # Getting two rows of the list by using slicing rows1 = rows[:2] rows2 = rows[2:] # Convert the lists of Rows to PySpark DataFrames df1 = spark.createDataFrame(rows1, df.schema) df2 = spark.createDataFrame(rows2, df.schema) # result df1.show() df2.show()
Output
+-----+---+ | Name|Age| +-----+---+ |Vivek| 31| | Aman| 20| +-----+---+ +-----+---+ | Name|Age| +-----+---+ |Sohan| 13| |David| 24| +-----+---+
Using Count(), Filter(), and, Drop() Method
In this program, dividing the dataframe into two row-wise dataframe needs count() and filter() method that divide the specific unique rows. The count() returns the total number of rows whereas filter() sets to divide the two different rows of a DataFrame. Then the Drop() method removes the rows that represent the partitioning of a dataframe.
Example
In the following example, first build the spark session and then set the rows data in the variable named data. Then set the columns name using createDataFrame() with spark that accepts two parameters- data( set the rows) and list(to set the column name) of a dataframe and store it in the variable df. Then use df.count() in the variable total_rows that will find the total number of rows. Next, define the number of rows for first dataframe in the variable n_rows_first_df. Then we add the row number column to the dataframe using built-in method row_number(), over(), and, Window.orderBy(). Now partition the dataframe into two different row-wise using the built-in method filter() and store it in their respective variable. Finally, it will use the two show() method with two different variables to get the result in form of two row-wise dataframe.
from pyspark.sql import SparkSession, Window from pyspark.sql.functions import row_number # Create a SparkSession spark = SparkSession.builder.getOrCreate() # Create the original DataFrame data = [("Rabina", 35), ("Stephen", 31), ("Raman", 33), ("Salman", 44),("Meera",37)] df = spark.createDataFrame(data, ["Name", "Age"]) # Get the total number of rows total_rows = df.count() # Define the number of rows for the first DataFrame n_rows_first_df = 2 # Add a row number column to the DataFrame df_with_row_number = df.withColumn("row_number", row_number().over(Window.orderBy("Name"))) # Slice the DataFrame into two using filter() first_df = df_with_row_number.filter(df_with_row_number.row_number <= n_rows_first_df).drop("row_number") second_df = df_with_row_number.filter(df_with_row_number.row_number > n_rows_first_df).drop("row_number") # Show the resulting DataFrames first_df.show() second_df.show()
Output
+------+---+ | Name|Age| +------+---+ | Meera| 37| |Rabina| 35| +------+---+ +-------+---+ | Name|Age| +-------+---+ | Raman| 33| | Salman| 44| |Stephen| 31| +-------+---+
Using Head() and ExceptAll() Method
Divide the dataframe into two row-wise dataframe, it uses the two methods head() and exceptAll() that will be used to separate the two dataframe with unique data rows.
Example
In the following example, it uses the built-in method count to get the total number of rows. Then it assigns the number of rows for the first DataFrame in the variable n_rows_first_df. For creating two dataframes it will use three different built-in functions like head(), createDataFrame(), and, exceptAll(), and store it in their respective variable. In the end, it will use two show() functions to get the two row-wise dataframe.
from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder.getOrCreate() # Create the original DataFrame data = [("karisma", 25), ("Bobby", 30), ("Champak", 35), ("Mark", 40)] df = spark.createDataFrame(data, ["Name", "Age"]) # Get the total number of rows total_rows = df.count() # Define the number of rows for the first DataFrame n_rows_first_df = 2 # Slice the DataFrame into two using head() and exceptAll() first_rows = df.head(n_rows_first_df) first_df = spark.createDataFrame(first_rows, df.schema) second_df = df.exceptAll(first_df) # Show the resulting DataFrames first_df.show() second_df.show()
Output
+-------+---+ | Name|Age| +-------+---+ |karisma| 25| | Bobby| 30| +-------+---+ +-------+---+ | Name|Age| +-------+---+ |Champak| 35| | Mark| 40| +-------+---+
Conclusion
We discussed the four unique methods to slice a PySpark dataframe in two row-wise dataframe. All these methods have unique ways to represent the partitioning of dataframe. The PySpark dataframe is high-level interactive data that can be used by data engineers and data scientists. Python API for spark and ML are a common example to visualize the PySpark dataframe.