bug提示
-
py4j.protocol.Py4JJavaError: An error occurred while calling o29.csv.
- java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong
问题原因
pyspark安装配置好读取s3服务器内容时候各种bug。试了网上的方法,最后总结一下,还是环境变量加载的hadoop-aws和aws-java-sdk,版本不匹配。经过多重测试,暂时发现有这两个包的版本可用:
hadoop-aws:2.7.3
aws-java-sdk:1.7.4
演示程序
这里只能通过s3a协议连接,该程序可连接国内和国外s3。
id和key需要更换
import pyspark
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3," \
"com.amazonaws:aws-java-sdk:1.7.4 " \
"pyspark-shell"
access_id = 'your_access_id'
access_key = 'your_access_key'
spark = pyspark.sql.SparkSession.builder.master('local').appName("hxy_test_script").getOrCreate()
sc = spark.sparkContext
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)
hadoop_conf.set("fs.s3a.endpoint", "s3.cn-north-1.amazonaws.com.cn")
sql = pyspark.SQLContext(sc)
path_list = ['s3a://bucket/DATA/00.csv']
df = sql.read.csv(path_list, header=True)
print(df.count())
print(df.show())