spark操作ES中数据出现的问题_javaesspark.savetoes不用修改的字段置空问题-CSDN博客

本文链接：https://blog.csdn.net/day_ue/article/details/120992103

本文探讨了Spark在处理Elasticsearch中空字符串order_tp时遇到的挑战，包括SQL查询失效和如何识别空字符串作为null。通过实例展示了不同查询方法的局限，并提供了常规的解决办法，如设置es.read.field.as.array.include选项。最后强调了统一空字符串为null的重要性以避免潜在问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Spark读取Es空字符串问题

问题描述：

order_tp在Es中存储为空字符串，读取到Spark中会出现各种意想不到的问题。

ES中存储格式

{
     "_index": "dwd_monitor_yuepengfei_test",
     "_type": "doc",
     "_id": "15",
     "_score": 1,
     "_source": {
       "chnl_cd": "10",
       "order_tp": ""
     }
}

spark中数据打印格式

+-------+--------+
|chnl_cd|order_tp|
+-------+--------+
|     10|    null|
+-------+--------+

根据order_tp字段无法筛选出上面那条数据

spark.sql("select * from test where order_tp = ''").show()
+-------+--------+
|chnl_cd|order_tp|
+-------+--------+
+-------+--------+
spark.sql("select * from test where order_tp <> ''").show()
+-------+--------+
|chnl_cd|order_tp|
+-------+--------+
+-------+--------+
spark.sql("select * from test where order_tp is null").show()
+-------+--------+
|chnl_cd|order_tp|
+-------+--------+
+-------+--------+

字段order_tpl无法查询筛选，却可以和null值聚合

spark
.sql("select 1 as chnl_cd, null as order_tp").withColumn("a", lit(-1))
.union(spark.sql("select chnl_cd,order_tp from test").withColumn("a", lit(1)))
.groupBy("order_tp")
.agg(Map("a" -> "collect_list")).show()

+--------+---------------+
|order_tp|collect_list(a)|
+--------+---------------+
|    null|        [-1, 1]|
+--------+---------------+

由于上述ES中的空字符串，在SparkSQL不能确定是怎样的存在。因此使用中尽量避免用空字符串，统一为null

Spark读取ES中的数组

ES没有数组字段的定义，存储String数组，定义的mapping为keyword类型。spark读取的数据时读到的maping为String类型，但是数据是数组类型报错。

常规解决方式

option("es.read.field.as.array.include", "数组字段")