How Does Spark Use MapReduce?

Apache Spark does use MapReduce — but only the idea of it, not the exact implementation. Confused? Let's talk about an example.

Anubhav Tarar

Jan. 04, 18 · Opinion

Likes (3)

Comment

Save

39.1K Views

In this article, we will talk about an interesting scenario: does Spark use MapReduce or not? The answer to the question is yes — but only the idea, not the exact implementation. Let's talk about an example. To read a text file from Spark, all we do is:

spark.sparkContext.textFile("fileName")

But do you know how does it actually works? Try to ctrl + click on this method text file. You will find this code:

/**
 * Read a text file from HDFS, a local file system (available on all nodes), or any
 * Hadoop-supported file system URI, and return it as an RDD of Strings.
 */
def textFile(
    path: String,
    minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
  assertNotStopped()
  hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
    minPartitions).map(pair => pair._2.toString).setName(path)
}

As you can see, it is calling the Hadoop file method of the Hadoop API with four parameters: file path, input format, LongWritable, and text input format.

So, it doesn't matter that you are reading a text file from the local S3; HDFS will always use the Hadoop API to read it.

hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)

Can you understand what this code is doing? The first parameter is the path of the file, the second parameter is the input format (which should be used to read this file), and the third and fourth parameters are similar to the record reader (which is the offset of the line itself).

Now, you might be thinking, "Why don't we get this offset back when we are reading from a file?" The reason for this is below:

.map(pair => pair._2.toString)

This is mapping over all the key-value pairs but only collecting the values.

hadoop MapReduce file IO

Published at DZone with permission of Anubhav Tarar, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending

How Does Spark Use MapReduce?

Apache Spark does use MapReduce — but only the idea of it, not the exact implementation. Confused? Let's talk about an example.

Related

Partner Resources