DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Have You Heard About Cloud Native Buildpacks?
  • Setting Up Local Kafka Container for Spring Boot Application
  • Docker Image Building Best Practices
  • An Introduction to BentoML: A Unified AI Application Framework

Trending

  • Build a Simple REST API Using Python Flask and SQLite (With Tests)
  • How to Create a Successful API Ecosystem
  • Event-Driven Microservices: How Kafka and RabbitMQ Power Scalable Systems
  • Apple and Anthropic Partner on AI-Powered Vibe-Coding Tool – Public Release TBD
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Running Apache Spark Applications in Docker Containers

Running Apache Spark Applications in Docker Containers

Even once your Spark cluster is configured and ready, you still have a lot of work to do before you can run it in a Docker container. But these tips can help make it easier!

By 
Arseniy Tashoyan user avatar
Arseniy Tashoyan
·
Aug. 26, 17 · Tutorial
Likes (4)
Comment
Save
Tweet
Share
54.7K Views

Join the DZone community and get the full member experience.

Join For Free

Apache Spark is a wonderful tool for distributed computations. However, some preparation steps are required on the machine where the application will be running. Assuming that you already have your Spark cluster configured and ready, you still have to do the following steps on your workstation:

  • Install Apache Spark distribution containing necessary tools and libraries.

  • Install Java Development Kit.

  • Install and configure SCM client, like Git.

  • Install and configure build tool, like SBT.

Then, you have to check out the source code from the repository, build the binary, and submit it to the Spark cluster using a special spark-submit tool. It should be clear now that one cannot simply just run an Apache Spark application... right?

Wrong! If you have the URL of the application source code and URL of the Spark cluster, then you can just run the application.

Let’s confine the complex things in a Docker container: docker-spark-submit. This Docker image serves as a bridge between the source code and the runtime environment, covering all intermediate steps.

Running applications in containers provides the following benefits:

  • Zero configuration on the machine because the container has everything it needs.

  • Clean application environment thanks to container immutability.

Here is an example of typical usage:

docker run \
  -ti \
  --rm \
  -p 5000-5010:5000-5010 \
  -e SCM_URL="https://github.com/mylogin/project.git" \
  -e SPARK_MASTER="spark://my.master.com:7077" \
  -e SPARK_DRIVER_HOST="host.domain" \
  -e MAIN_CLASS="Main" \
  tashoyan/docker-spark-submit:spark-2.2.0

Parameters SCM_URL, SPARK_MASTER,  and MAIN_CLASS are self-describing. Other less intuitive, but important, parameters are as follows.

tashoyan/docker-spark-submit:spark-2.2.0

Choose the tag of the container image based on the version of your Spark cluster. In this example, Spark 2.2.0 is assumed.

-p 5000-5010:5000-5010

It is necessary to publish this range of network ports. The Spark driver program and Spark executors use these ports for communication.

-e SPARK_DRIVER_HOST="host.domain"

You have to specify where the network address of the host machine where the container will be running. Spark cluster nodes should be able to resolve this address. This is necessary for communication between executors and the driver program. For detailed technical information, see SPARK-4563.

Detailed instructions, as well as some examples, are available at the project page on GitHub. You can find there:

  • How to run the application code from a custom Git branch or from a custom subdirectory.

  • How to supply data for your Spark application by means of Docker volumes.

  • How to provide custom Spark settings or application arguments.

  • How to run docker-spark-submit on a machine behind a proxy server.

To conclude, let me emphasize that docker-spark-submit is not intended for continuous integration. The intended usage is to let people quickly try Spark applications, saving them from configuration overhead. CI practices assume separate stages for building, testing, and deploying;  docker-spark-submit does not follow these practices.

Docker (software) application Apache Spark

Published at DZone with permission of Arseniy Tashoyan. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Have You Heard About Cloud Native Buildpacks?
  • Setting Up Local Kafka Container for Spring Boot Application
  • Docker Image Building Best Practices
  • An Introduction to BentoML: A Unified AI Application Framework

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: