
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Create PySpark DataFrame from Multiple Lists
PySpark is a powerful tool for processing large datasets in a distributed computing environment. One of the fundamental tasks in data analysis is to convert data into a format that can be easily processed and analysed. In PySpark, data is typically stored in a DataFrame, which is a distributed collection of data organised into named columns.
In some cases, we may want to create a PySpark DataFrame from multiple lists. This can be useful when we have data in a format that is not easily loaded from a file or database. For example, we may have data stored in Python lists or NumPy arrays that we want to convert to a PySpark DataFrame for further analysis.
In this article, we will explore how to create a PySpark DataFrame from multiple lists. We will discuss different approaches and provide code examples with comments and outputs for each approach.
Convert lists to a NumPy array and then to a PySpark DataFrame
One approach to create a PySpark DataFrame from multiple lists is to first convert the lists to a NumPy array and then create a PySpark DataFrame from the NumPy array using the createDataFrame() function. This approach requires the pyspark.sql.types module to specify the schema of the DataFrame.
Consider the code shown below.
Example
import numpy as np from pyspark.sql.types import StructType, StructField, IntegerType # Define the lists age = [20, 25, 30, 35, 40] salary = [25000, 35000, 45000, 55000, 65000] # Convert the lists to a NumPy array data = np.array([age, salary]).T # Define the schema schema = StructType([ StructField("age", IntegerType(), True), StructField("salary", IntegerType(), True) ]) # Create the PySpark DataFrame df = spark.createDataFrame(data.tolist(), schema=schema) # Show the DataFrame df.show()
Explanation
First, we import the required modules ? numpy and pyspark.sql.types.
Next, we define the two lists: age and salary.
We then convert the lists to a NumPy array using the np.array() function and transpose the array using .T.
After that, we define the schema of the DataFrame using the StructType() and StructField() functions. In this case, we define two columns ? age and salary ? with IntegerType() data type.
Finally, we create the PySpark DataFrame using the createDataFrame() function and passing the NumPy array converted to a list and the schema as parameters. We then show the DataFrame using the show() function.
Output
+---+------+ |age|salary| +---+------+ | 20| 25000| | 25| 35000| | 30| 45000| | 35| 55000| | 40| 65000| +---+------+
Using PySpark's createDataFrame() method
In this approach, we will create a PySpark dataframe directly from the lists using the createDataFrame() method provided by PySpark. We will first create a list of tuples, where each tuple represents a row in the dataframe. Then, we will create a schema that defines the structure of the dataframe, i.e., the column names and data types. Finally, we will create a dataframe using the createDataFrame() method by passing the list of tuples and the schema as arguments.
Consider the code shown below.
Example
from pyspark.sql.types import StructType, StructField, IntegerType, StringType from pyspark.sql import SparkSession # Initialize SparkSession spark = SparkSession.builder.appName("Create DataFrame from Lists").getOrCreate() # Define the data as lists names = ["Alice", "Bob", "Charlie", "David"] ages = [25, 30, 35, 40] genders = ["Female", "Male", "Male", "Male"] # Define the schema of the dataframe schema = StructType([ StructField("Name", StringType(), True), StructField("Age", IntegerType(), True), StructField("Gender", StringType(), True) ]) # Create a list of tuples data = [(names[i], ages[i], genders[i]) for i in range(len(names))] # Create a PySpark dataframe df = spark.createDataFrame(data, schema) # Show the dataframe df.show()
Explanation
First, we import the required modules ? numpy and pyspark.sql.types.
Next, we define the two lists: age and salary.
We then convert the lists to a NumPy array using the np.array() function and transpose the array using .T.
After that, we define the schema of the DataFrame using the StructType() and StructField() functions. In this case, we define two columns ? age and salary ? with IntegerType() data type.
Finally, we create the PySpark DataFrame using the createDataFrame() function and passing the NumPy array converted to a list and the schema as parameters. We then show the DataFrame using the show() function.
Output
+-------+---+---------------+ | Name |Age| Gender| +-------+---+----------------+ | Alice | 25 | Female | | Bob | 30 | Male | | Charlie | 35 | Male | | David | 40 | Male | +-------+---+---------------+
Conclusion
In this article, we explored two different approaches to create a PySpark dataframe from multiple lists. The first approach used the Row() function to create rows of data and then created the dataframe using the createDataFrame() method. The second approach used the StructType() and StructField() functions to define a schema and then created the dataframe using the createDataFrame() method with the data and schema as arguments.