3 个数据摄取技巧，彻底改变你的搜索方式

Elastic 中国社区官方博客

于 2025-06-05 09:18:12 发布

阅读量606

点赞数 11

分类专栏： Elasticsearch Elastic 文章标签：运维人工智能 elasticsearch 大数据搜索引擎 ai 全文检索

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/UbuntuTouch/article/details/148439323

版权

Elastic 同时被 2 个专栏收录

1885 篇文章

订阅专栏

Elasticsearch

1284 篇文章

订阅专栏

作者：来自 Elastic Alexander Dávila

通过以下技巧，将你的 Elasticsearch 数据摄取提升到新水平：数据预处理、数据增强，以及选择正确的字段类型。

了解将数据摄取到 Elasticsearch 的多种方式，并通过实用示例尝试一些新方法。

Elasticsearch 拥有丰富的新功能，帮助你为你的使用场景构建最佳的搜索解决方案。现在就开始免费试用吧。

Elasticsearch 的灵活性非常适合构建定制化的搜索方案。但正因为它的灵活性，如果不小心，也很容易出错。无论你是刚开始搭建，还是在优化现有设置，及早实施一些聪明的策略，都能为你节省大量时间和麻烦。

在这篇文章中，我们将介绍 3 个关键技巧：数据预处理（也叫“数据清洗”）、数据增强，以及选择正确的字段类型，帮助你立刻提升搜索系统的性能，并避免常见陷阱。

实用用例设置

为了看到这些技巧的实际效果，让我们来看一个常见的用例：一个社交媒体分析平台。在这个场景中，每个文档都包含一条帖子的相关数据。数据定义如下。

索引映射：这个映射包含了应用本文所提技巧所需的所有字段定义。

PUT post-performance-metrics
{
  "mappings": {
    "properties": {
      "hashtags_string": {
        "type": "keyword"
      },
      "total_engagements": {
        "type": "integer"
      },
      "likes": {
        "type": "integer",
        "fields": {
          "ranking": {
            "type": "rank_feature"
          }
        }
      },
      "comments": {
        "type": "integer"
      },
      "shares": {
        "type": "integer"
      },
      "follower_count": {
        "type": "long"
      },
      "follower_tier": {
        "type": "keyword"
      },
      "user_id": {
        "type": "keyword"
      },
      "content": {
        "type": "text"
      }
    }
  }
}

示例文档：这是文档最初的样子。

POST post-performance-metrics/_doc
{
    "user_id": "user123",
    "hashtags_string": "#elastic,#kibana,#ingest_tips",
    "likes": 12,
    "comments": 6,
    "shares": 2,
    "follower_count": 2560,
    "content": "Just learned more tips to improve my elastic game! #elastic #kibana #ingest_tips"
}

技巧

整理字段 - massage fiels

“整理” 字段指的是对数据进行预处理，以便更好地支持搜索。这样做的原因包括：

增强功能（例如：搜索数组）
性能提升（例如：预计算字段）

将字符串列表转换为真正的数组

将字符串列表转换为真正的数组，可以让我们更准确地过滤或聚合文档；例如，假设我们有一个文档包含以下 keyword 字段：

{
    "hashtags_string": "#elastic,#kibana,#ingest_tips"
}

使用这种类型的字段，我们无法单独过滤每个 hashtag；必须完全匹配整个字符串。因此，下面这个查询将无法返回该文档：

GET post-performance-metrics/_search
{
    "query": {
        "term": {
            "hashtags_string": {
                "value": "#ingest_tips"
            }
        }
    }
}

另一方面，如果我们将 hashtags 拆分，使字段变成这样：

{
    "hashtags_string": [
        "#elastic",
        "#kibana",
        "#ingest_tips"
    ]
}

该查询实际上可以返回文档，因为存在一个精确匹配的词项 #ingest_tips！

我们可以在一个 ingest pipeline 中这样定义 split 处理器：

PUT _ingest/pipeline/hashtag_splitter
{
    "description": "Splits hashtag string into array",
    "processors": [
        {
            "split": {
                "field": "hashtags_string",
                "separator": ",",
                "target_field": "hashtags_string"
            }
        }
    ]
}

注意，我们可以为源字段（field 参数）和目标字段（target_field 参数）定义不同的字段名。我们也可以定义任意字符（或模式）来拆分字符串。

要使用这个 pipeline，可以将它定义为索引的默认 pipeline。这样，任何被索引的文档都会经过这个 pipeline，数据会立即准备好使用。

运行该 pipeline 的其他方式包括使用 _update_by_query 请求，或通过 reindex 操作指定要使用的 pipeline。

预计算字段

在分析数据集计算指标时，一些常见操作会反复出现。例如，计算总互动数时，我们需要将 likes + comments + shares 相加。我们可以每次查询时计算，或者只在索引文档时计算一次（预计算）。后一种方式通过减少查询时间，显著提升性能。

为此，我们可以定义一个脚本处理器，在摄取过程中执行该操作，并设置总值。这样，最终得到的文档会是这样的：

{ 
 "total_engagements": 20,
 "likes": 12,
 "comments": 6,
 "shares": 2
}

我们同样可以这样定义一个 ingest pipeline：

PUT _ingest/pipeline/engagement_calculator
{
    "description": "Calculates total engagement",
    "processors": [
        {
            "script": {
                "source": """
                    int likes = ctx.likes != null ? ctx.likes : 0;
                    int shares = ctx.shares != null ? ctx.shares : 0;
                    int comments = ctx.comments != null ? ctx.comments : 0;
                    ctx.total_engagements = likes + shares + comments;
                """
            }
        }
    ]
}

我们还添加了一个 if 参数，以确保所有相关字段都存在。

将范围预计算为单个字段

为了加快搜索速度，我们可以基于现有数据计算分类字段。在示例中，我们将创建一个 follower_tier 字段，根据作者的粉丝数量对帖子发布者进行分类。

在计算新的分类字段之前，先看一下查询。我们想获取中等规模创作者的帖子，这里定义为粉丝数在 10001 到 100000 之间。

为此，我们可以使用范围查询。但每次使用该过滤条件时，都要记住定义，且范围查询比精确匹配查询更慢。

GET post-performance-metrics/_search
{
    "query": {
        "range": {
            "follower_count": {
                "gte": 10001,
                "lte": 100000
            }
        }
    }
}

现在，让我们在脚本处理器中定义 3 个 follower tier：

PUT _ingest/pipeline/follower_tier_calculator
{
    "description": "Assigns influencer tier based on follower count",
    "processors": [
        {
            "script": {
                "source": """
         if (ctx.follower_count < 10000) {
           ctx.follower_tier = "small";
                } else if (ctx.follower_count < 100001) {
           ctx.follower_tier = "medium";
                } else {
           ctx.follower_tier = "large";
                }
       ""","if": "ctx.follower_count != null"
            }
        }
    ]
}

现在我们的文档有了一个新的 keyword 字段，叫 follower_tier，里面存着预计算好的 follower_tier 分类：

{
    "follower_count": 25600,
    "follower_tier": "medium"
}

这样我们就可以使用更快、更易用的 term 查询来过滤这些创作者：

GET post-performance-metrics/_search
{
    "query": {
        "term": {
            "follower_tier": {
                "value": "medium"
            }
        }
    }
}

term 查询通常比范围查询更快，因为它们是直接查找。通过将范围转换为单个字段，我们可以利用这种速度优势。

丰富数据

数据增强指的是使用外部来源扩展索引中的数据，给索引文档增加上下文和深度。

Enrich pipeline

Enrich pipeline 在摄取时利用另一个索引的数据来增强当前索引的文档。这简化了数据管理，通过集中额外信息（比如用专门来源增强多个索引），使对增强数据的查询更加一致。

在我们的示例中，我们将用来自不同索引的创作者人口统计数据来增强帖子。

1. 创建一个 user_demographics 索引，里面有如下数据：

POST /user_demographics/_doc
{
    "user_id": "user123",
    "age_group": "25-34",
    "interests": [
        "technology",
        "fashion",
        "travel"
    ],
    "account_creation_date": "2022-01-15",
    "user_segment": "tech_enthusiast"
}

2. 创建并执行一个 enrich 策略：enrich 策略定义了数据与我们要增强的文档之间的关联方式。

PUT /_enrich/policy/user_demographics_policy
{
    "match": {
        "indices": "user_demographics",
        "match_field": "user_id",
        "enrich_fields": [
            "age_group",
            "interests",
            "account_creation_date",
            "user_segment"
        ]
    }
}

在这个例子中，我们将帖子索引的 user_id 与人口统计索引匹配，用来增强新进来的文档，添加所有其他字段。

执行该策略时，我们运行：

POST /_enrich/policy/user_demographics_policy/_execute

现在，我们将创建一个使用该策略的 ingest pipeline：

PUT /_ingest/pipeline/enrich_posts_with_user_data
{
    "description": "Enriches posts with user demographic data",
    "processors": [
        {
            "enrich": {
                "policy_name": "user_demographics_policy",
                "field": "user_id",
                "target_field": "user_demographics",
                "max_matches": 1
            }
        }
    ]
}

3. 测试新的策略。我们可以看到，任何带有匹配 user_id 的新帖子文档都会被增强，添加人口统计数据。

原始文档：

{
    "user_id": "user123"
}

增强后的结果：

{
    "user_demographics": {
        "account_creation_date": "2022-01-15",
        "user_id": "user123",
        "age_group": "25-34",
        "user_segment": "tech_enthusiast",
        "interests": [
            "technology",
            "fashion",
            "travel"
        ]
    },
    "user_id": "user123"
}

推理 pipeline

推理 pipeline 让我们可以使用部署在 Elasticsearch 集群中 ML 节点上的机器学习（ML）模型，根据文档生成推断数据。在示例中，我们将使用 Elasticsearch 自带的 lang_ident_model_1 模型，它用于从文本中识别语言。

1. 创建一个包含推理处理器的 ingest pipeline 来使用该模型：

PUT /_ingest/pipeline/detect_language
{
    "description": "Detects language of post content",
    "processors": [
        {
            "inference": {
                "model_id": "lang_ident_model_1",
                "target_field": "post_language",
                "field_map": {
                    "content": "text"
                }
            }
        }
    ]
}

注意我们如何定义目标字段（结果存储位置）和字段映射中的源字段（content）。

2. 将 pipeline 应用于我们的数据。

原始文档：

{
  "content": "Just learned more tips to improve my elastic game! "
}

查看带有推断字段的结果：

{
    "content": "Just learned more tips to improve my elastic game! ",
    "post_language": {
        "prediction_score": 0.9993826906956544,
        "model_id": "lang_ident_model_1",
        "prediction_probability": 0.9993826906956544,
        "predicted_value": "en"
    }
}

我们的结果（en）存储在 post_language.predicted_value。

让我们再试一个例子！

原始文档：

{
    "content": "Kibana me permitió crear las visualizaciones en minutos"
}

带有推断字段的结果：

{
    "content": "Kibana me permitió crear las visualizaciones en minutos",
    "post_language": {
        "prediction_score": 0.9823283879653826,
        "model_id": "lang_ident_model_1",
        "prediction_probability": 0.9823283879653826,
        "predicted_value": "es"
    }
}

这里，模型正确地将文本分类为西班牙语（Spanish）！

总体来说，这个功能值得探索，因为它支持很多有趣的用例，比如：

使用 completion 任务类型进行 LLM 文档摘要
稀疏向量查询（通过 Elastic 的 ELSER 推理端点）
命名实体识别（NER）
文本分类（例如用于情感分析）
生成向量嵌入（用于语义搜索）

如果你需要使用不同的模型，可以看看 Elastic 的 Eland，它与 HuggingFace 配合得很好！

额外提示：记住，你可以在同一个 ingest pipeline 中定义多个处理器，这些处理器不一定要相关。所以我们这里所有的示例都可以放在一个 pipeline 里一起使用！

选择正确的字段类型

选择合适的字段类型常被忽视，但它对性能和功能有很大影响。Elasticsearch 提供了 40 多种字段类型，每种都有自己的优势。有些主要是性能优化，比如根据值的长度选择数字类型；有些则提供额外封装的功能。你知道吗：

使用 IP 字段类型可以按 ip ranges 搜索，还可以用掩码？
search_as_you_type 可以轻松实现输入联想？
percolator 可以帮你构建告警系统？
semantic_text 是开箱即用的语义搜索？
rank_features 可以存储数字，并用作相关性指标？

还有更多！你可以查看官方文档了解详情。

rank_feature 示例

本文中，我们用帖子的 likes 数量来提升其在搜索结果中的排名。为此，我们可以定义一个 rank_feature 类型字段，并运行 rank_feature 查询。这会为查询增加一个非常有用的功能，同时对性能的影响远小于 function score 查询。

1. 定义正确的映射：因为我们可能想在其他查询中使用 likes 计数，所以将 rank_feature 定义为一个名为 “ranking” 的多字段。

PUT post-performance-metrics
{
    "mappings": {
        "properties": {
            "content": {
                "type": "text"
            },
            "likes": {
                "type": "integer",
                "fields": {
                    "ranking": {
                        "type": "rank_feature"
                    }
                }
            }
        }
    }
}

这里只展示了相关的映射部分。

2 . 索引包含 “likes” 字段的文档：记住，我们的映射定义已经负责填充所有多字段，所以我们只需要定义主字段 likes 即可。

POST post-performance-metrics/_doc
{
 "content": "Just upgraded our Elasticsearch deployment and the speed difference is noticeable.",
 "likes": 78
}

POST post-performance-metrics/_doc
{
 "content": "How we cut our search response time in half with Elasticsearch optimization",
 "likes": 1200
}

3. 运行 rank_feature 查询：通常会放在 should 子句中，只影响评分，不影响返回的文档结果。

GET post-performance-metrics/_search
{
    "query": {
        "bool": {
            "must": {
                "match": {
                    "content": "elasticsearch speed"
                }
            },
            "should": [
                {
                    "rank_feature": {
                        "field": "likes.ranking",
                        "boost": 1.5
                    }
                }
            ]
        }
    }
}

这里的期望是点赞更多的帖子会获得额外加分，这样即使查询更匹配其他帖子，也能优先显示更受欢迎的帖子。注意，我们可以控制点赞数对评分的影响程度；如果将 boost 设置低于 1.0，评分甚至会受到负面影响：

{
    "hits": {
        "hits": [
            {
                "_score": 1.3652647,
                "_source": {
                    "content": "How we cut our search response time in half with Elasticsearch optimization",
                    "likes": 1200
                }
            },
            {
                "_score": 1.2482762,
                "_source": {
                    "content": "Just upgraded Elasticsearch deployment and the speed difference is unbelievable!",
                    "likes": 78
                }
            }
        ]
    }
}

相关性是主观的！它取决于你的用户是谁，他们在搜索什么，怎么搜索，你的数据是什么，业务需求是什么，等等。了解更多关于相关性的信息。

这里的想法是构建更复杂的查询，同时对整体排名产生影响，而 rank_feature 查询作用于该排名，以获得更好的结果。