All Products
Search
Document Center

E-MapReduce:Use spark-submit to submit a Spark job

Last Updated:Apr 17, 2025

This topic describes how to use spark-submit to submit a Spark job after E-MapReduce (EMR) Serverless Spark is connected to Elastic Compute Service (ECS).

Prerequisites

  • Java Development Kit (JDK) V1.8 or later is installed.

  • If you want to use a RAM user to submit Spark jobs, make sure that the RAM user is added to a Serverless Spark workspace as a member and assigned the developer role or a role that has higher permissions. For more information, see Manage users and roles.

Procedure

Step 1: Download and install the spark-submit tool for EMR Serverless

  1. Click emr-serverless-spark-tool-0.4.0-bin.zip to download the installation package.

  2. Upload the installation package to an ECS instance. For more information, see Upload and download files.

  3. Run the following command to decompress the installation package and install the spark-submit tool:

    unzip emr-serverless-spark-tool-0.4.0-bin.zip

Step 2: Configure parameters

Important

If the SPARK_CONF_DIR environment variable is configured in the environment in which the spark-submit tool is installed, you must store the configuration file in the directory specified by the SPARK_CONF_DIR environment variable. For example, for EMR clusters, the directory is /etc/taihao-apps/spark-conf in most cases. Otherwise, an error is reported.

  1. Run the following command to modify the configuration of the connection.properties file:

    vim emr-serverless-spark-tool-0.4.0/conf/connection.properties
  2. Configure parameters in the file based on the following sample code. The parameters are specified in the key=value format.

    accessKeyId=yourAccessKeyId
    accessKeySecret=yourAccessKeySecret
    regionId=cn-hangzhou
    endpoint=emr-serverless-spark.cn-hangzhou.aliyuncs.com
    workspaceId=w-xxxxxxxxxxxx
    resourceQueueId=dev_queue
    releaseVersion=esr-2.2 (Spark 3.3.1, Scala 2.12, Java Runtime)

    The following table describes the parameters.

    Parameter

    Required

    Description

    accessKeyId

    Yes

    The AccessKey ID of the Alibaba Cloud account or RAM user that is used to run the Spark job.

    accessKeySecret

    Yes

    The AccessKey secret of the Alibaba Cloud account or RAM user that is used to run the Spark job.

    regionId

    Yes

    The region ID. In this example, the China (Hangzhou) region is used.

    endpoint

    Yes

    The endpoint of EMR Serverless Spark. For more information, see Endpoints.

    In this example, the public endpoint in the China (Hangzhou) region emr-serverless-spark.cn-hangzhou.aliyuncs.com is used.

    Note

    If the ECS instance cannot access the Internet, you must use the virtual private cloud (VPC) endpoint of EMR Serverless Spark.

    workspaceId

    Yes

    The ID of the EMR Serverless Spark workspace.

    resourceQueueId

    No

    The name of the queue. Default value: dev_queue.

    Note

    If you use a RAM user, make sure that the RAM user is granted the required permissions on the destination queue. Otherwise, the error dev_queue is not valid may be reported. For information about how to grant the required permissions on a queue, see Manage permissions on a resource queue.

    releaseVersion

    No

    The version of EMR Serverless Spark. Example: esr-4.1.1 (Spark 3.5.2, Scala 2.12, Java Runtime).

Step 3: Submit a Spark job

  1. Run the following command to go to the directory of the spark-submit tool:

    cd emr-serverless-spark-tool-0.4.0
  2. Select a job submission method based on the job type.

    Use spark-submit

    spark-submit is a general-purpose job submission tool provided by Spark. You can use the tool to submit jobs launched from Java, Scala, or PySpark.

    Spark job launched from Java or Scala

    In this example, the test JAR package spark-examples_2.12-3.3.1.jar is used. You can click spark-examples_2.12-3.3.1.jar to download the test JAR package. Then, you can upload the JAR package to Object Storage Service (OSS). The JAR package is a simple example provided by Spark to calculate the value of pi (π).

    ./bin/spark-submit  --name SparkPi \
    --queue dev_queue  \
    --num-executors 5 \
    --driver-memory 1g \
    --executor-cores 2 \
    --executor-memory 2g \
    --class org.apache.spark.examples.SparkPi \
     oss://<yourBucket>/path/to/spark-examples_2.12-3.3.1.jar \
    10000

    Spark job launched from PySpark

    In this example, the test files DataFrame.py and employee.csv are used. You can click DataFrame.py and employee.csv to download the test files. Then, you can upload a JAR package that contains the files to OSS.

    Note
    • The DataFrame.py file contains the code that is used to process data in Object Storage Service (OSS) under the Apache Spark framework.

    • The employee.csv file contains data such as employee names, departments, and salaries.

    ./bin/spark-submit --name PySpark \
    --queue dev_queue  \
    --num-executors 5 \
    --driver-memory 1g \
    --executor-cores 2 \
    --executor-memory 2g \
    --conf spark.tags.key=value \
    oss://<yourBucket>/path/to/DataFrame.py \
    oss://<yourBucket>/path/to/employee.csv

    The following tables describe the parameters.

    • Supported open source parameters

      Parameter

      Example

      Description

      --name

      SparkPi

      The application name of the Spark job, which is used to identify the job.

      --class

      org.apache.spark.examples.SparkPi

      The entry class of the Spark job. This parameter is required only if the Spark job is launched from Java or Scala.

      --num-executors

      10

      The number of executors of the Spark job.

      --driver-cores

      1

      The number of driver cores of the Spark job.

      --driver-memory

      4g

      The size of driver memory of the Spark job.

      --executor-cores

      1

      The number of executor cores of the Spark job.

      --executor-memory

      1024m

      The size of executor memory of the Spark job.

      --files

      oss://<yourBucket>/file1,oss://<yourBucket>/file2

      The resource files used by the Spark job. Only resource files stored in OSS are supported. Separate multiple files with commas (,).

      --py-files

      oss://<yourBucket>/file1.py,oss://<yourBucket>/file2.py

      The Python scripts used by the Spark job. Only Python scripts stored in OSS are supported. Separate multiple scripts with commas (,). This parameter is required only if the Spark job is launched from PySpark.

      --jars

      oss://<yourBucket>/file1.jar,oss://<yourBucket>/file2.jar

      The JAR packages used by the Spark job. Only JAR packages stored in OSS are supported. Separate multiple packages with commas (,).

      --archives

      oss://<yourBucket>/archive.tar.gz#env,oss://<yourBucket>/archive2.zip

      The archive packages used by the Spark job. Only archive packages stored in OSS are supported. Separate multiple packages with commas (,).

      --queue

      root_queue

      The name of the queue on which the Spark job runs. The queue name must be the same as that in the EMR Serverless Spark workspace.

      --proxy-user

      test

      The parameter value overrides the value of the HADOOP_USER_NAME environment variable. The parameter functions the same way as the corresponding open source parameter.

      --conf

      spark.tags.key=value

      The custom parameter of the Spark job.

      --status

      jr-8598aa9f459d****

      The status of the Spark job.

      --kill

      jr-8598aa9f459d****

      Terminates the Spark job.

    • Non-open source enhanced parameters

      Parameter

      Example

      Description

      --detach

      N/A

      Exits the spark-submit tool. If you use this parameter, you do not need to wait for the tool to return the job status. The spark-submit tool immediately exits after the Spark job is submitted.

      --detail

      jr-8598aa9f459d****

      The details of the Spark job.

      --release-version

      esr-4.1.1 (Spark 3.5.2, Scala 2.12, Java Runtime)

      The specified Spark version. Configure this parameter based on the engine version displayed in the console.

      --enable-template

      N/A

      The job uses the global configurations of the workspace by default. If the related parameter is also specified in --conf, the parameter value specified in --conf prevails.

      --timeout

      60

      The timeout period of the job. Unit: seconds.

    • Unsupported open source parameters

      • --deploy-mode

      • --master

      • --repositories

      • --keytab

      • --principal

      • --total-executor-cores

      • --driver-library-path

      • --driver-class-path

      • --supervise

      • --verbose

    Use spark-sql

    spark-sql is a tool used to run SQL queries or scripts. spark-sql is suitable for scenarios where SQL statements are directly executed.

    • Example 1: Directly execute an SQL statement

      spark-sql -e "SHOW TABLES"

      This command lists all tables in the current database.

    • Example 2: Run an SQL script file

      spark-sql -f oss://<yourBucketname>/path/to/your/example.sql

      In this example, the test file example.sql is used. You can click example.sql to download the test file. Then, you can upload a JAR package that contains the file to OSS.

      Sample content in the example.sql file

      CREATE TABLE IF NOT EXISTS employees (
          id INT,
          name STRING,
          age INT,
          department STRING
      );
      
      INSERT INTO employees VALUES
      (1, 'Alice', 30, 'Engineering'),
      (2, 'Bob', 25, 'Marketing'),
      (3, 'Charlie', 35, 'Sales');
      
      SELECT * FROM employees;
      

    The following table describes the parameters.

    Parameter

    Example

    Description

    -e "<sql>"

    -e "SELECT * FROM table"

    Executes SQL statements directly in the CLI.

    -f <path>

    -f oss://path/script.sql

    Executes the SQL script file in the specified path.

Step 4: Query the Spark job

Use the CLI

Query the status of the Spark job

cd emr-serverless-spark-tool-0.4.0
./bin/spark-submit --status <jr-8598aa9f459d****>

View the details of the Spark job

cd emr-serverless-spark-tool-0.4.0
./bin/spark-submit --detail <jr-8598aa9f459d****>

Use the UI

  1. In the left-side navigation pane of the EMR Serverless Spark page, click Job History.

  2. On the Job History page, click the Development Job Runs tab. Then, you can view all submitted jobs.

    image

(Optional) Step 5: Terminate the Spark job

cd emr-serverless-spark-tool-0.4.0
./bin/spark-submit --kill <jr-8598aa9f459d****>
Note

You can terminate only a job that is in the Running state.