DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • All You Need to Know About Apache Spark
  • Building a Data Warehouse for Traditional Industry
  • 7 Tips on Writing Good Technical Content
  • High-Performance Java Serialization to Different Formats

Trending

  • Analyzing Techniques to Provision Access via IDAM Models During Emergency and Disaster Response
  • How to Merge HTML Documents in Java
  • Enhancing Business Decision-Making Through Advanced Data Visualization Techniques
  • AI Speaks for the World... But Whose Humanity Does It Learn From?
  1. DZone
  2. Data Engineering
  3. Big Data
  4. How Does Spark Use MapReduce?

How Does Spark Use MapReduce?

Apache Spark does use MapReduce — but only the idea of it, not the exact implementation. Confused? Let's talk about an example.

By 
Anubhav Tarar user avatar
Anubhav Tarar
·
Jan. 04, 18 · Opinion
Likes (3)
Comment
Save
Tweet
Share
39.1K Views

Join the DZone community and get the full member experience.

Join For Free

In this article, we will talk about an interesting scenario: does Spark use MapReduce or not? The answer to the question is yes — but only the idea, not the exact implementation. Let's talk about an example. To read a text file from Spark, all we do is:

spark.sparkContext.textFile("fileName")

But do you know how does it actually works? Try to ctrl + click on this method text file. You will find this code:

/**
 * Read a text file from HDFS, a local file system (available on all nodes), or any
 * Hadoop-supported file system URI, and return it as an RDD of Strings.
 */
def textFile(
    path: String,
    minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
  assertNotStopped()
  hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
    minPartitions).map(pair => pair._2.toString).setName(path)
}

As you can see, it is calling the Hadoop file method of the Hadoop API with four parameters: file path, input format, LongWritable, and text input format.

So, it doesn't matter that you are reading a text file from the local S3; HDFS will always use the Hadoop API to read it.

hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)

Can you understand what this code is doing? The first parameter is the path of the file, the second parameter is the input format (which should be used to read this file), and the third and fourth parameters are similar to the record reader (which is the offset of the line itself).

Now, you might be thinking, "Why don't we get this offset back when we are reading from a file?" The reason for this is below:

.map(pair => pair._2.toString)

This is mapping over all the key-value pairs but only collecting the values.

hadoop MapReduce file IO

Published at DZone with permission of Anubhav Tarar, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • All You Need to Know About Apache Spark
  • Building a Data Warehouse for Traditional Industry
  • 7 Tips on Writing Good Technical Content
  • High-Performance Java Serialization to Different Formats

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: