使用 Elastic 中的查询配置文件改进电子商务搜索

Elastic 中国社区官方博客

于 2025-01-10 09:40:27 发布

阅读量771

点赞数 23

CC 4.0 BY-SA版权

分类专栏： Elasticsearch AI Elastic 文章标签：服务器人工智能 elasticsearch 大数据全文检索 ai 数据库

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/UbuntuTouch/article/details/145047628

Elastic 同时被 3 个专栏收录

1943 篇文章

订阅专栏

Elasticsearch

1315 篇文章

订阅专栏

531 篇文章

订阅专栏

作者：来自 Elastic Han Xiang Choong

查询配置文件（Query profiles）解决了电子商务中的语义搜索挑战。本博客演示了如何使用 Elastic 中的查询配置文件增强电子商务搜索。

由于产品格式远未标准化，因此为电子商务用例实施语义搜索可能很棘手。在本博客中，我们通过使用 Elastic 中的查询配置文件来解决这一挑战 - 这种方法采用多个元数据字段并将它们转换为类似于用户偏好和请求的一段文本。通过一个实际示例，我们展示了查询配置文件如何改进电子商务搜索。

简介

Elasticsearch 非常适合电子商务数据，我的意思是大量的产品定义，比如这个亚马逊产品数据集。让我们下载包含 10,000 种产品的示例文件，并将 CSV 上传到名为 amazon_product_10k 的 Elastic 索引（我正在使用我的 Elastic Cloud 部署）。

当我们查看数据时，我们会看到类似这样的产品描述，关于一个名为黑闪电（Black Lightning）的超级英雄主题摇头娃娃（superhero themed bobblehead）：

{
  "_index": "amazon_product_10k_plain_embed",
  "_id": "F-Qi2JIBnZufN_5vn-sr",
  "_version": 1,
  "_score": 0,
  "_ignored": [
    "Image.keyword",
    "product_specification.keyword",
    "technical_details.keyword"
  ],
  "_source": {
    "selling_price": "$17.75",
    "Category": "Toys & Games | Collectible Toys | Statues, Bobbleheads & Busts | Statues",
    "shipping_weight": "3.7 pounds",
    "product_specification": "ProductDimensions:3x3x12.4inches|ItemWeight:2pounds|ShippingWeight:3.7pounds(Viewshippingratesandpolicies)|DomesticShipping:ItemcanbeshippedwithinU.S.|InternationalShipping:ThisitemcanbeshippedtoselectcountriesoutsideoftheU.S.LearnMore|ASIN:B077SCH3B2|Itemmodelnumber:DEC170420|Manufacturerrecommendedage:15yearsandup",
    "is_amazon_seller": "Y",
    "id": "b4358a38037a7e7fbcd7fe16970e7bff",
    "model_number": "DEC170420",
    "Image": "https://images-na.ssl-images-amazon.com/images/I/41vJ5amvKeL.jpg|https://images-na.ssl-images-amazon.com/images/I/31tr4qwqZmL.jpg|https://images-na.ssl-images-amazon.com/images/I/31-%2BMGqJASL.jpg|https://images-na.ssl-images-amazon.com/images/I/31sgD4%2B0HlL.jpg|https://images-na.ssl-images-amazon.com/images/I/51HbaURPW8L.jpg|https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel.jpg",
    "product_name": "DC Collectibles DCTV: Black Lightning Resin Statue",
    "about_product": "Make sure this fits by entering your model number. | From the upcoming DCTV series on The CW | Limited Edition of 5,000 | Measures approximately 12.42\" tall | Sculpted by Alterton",

    "url": "https://www.amazon.com/DC-Collectibles-DCTV-Lightning-Statue/dp/B077SCH3B2",
    "technical_details": "show up to 2 reviews by default Jefferson Pierce returns to the superhero fold as Black Lightning, the star of The CW's upcoming TV series Black Lightning! Limited edition of 5,000. Measures approximately 12.42\" tall. Sculpted by Alterton. | 3.7 pounds (View shipping rates and policies)"
  },
}

此处的搜索用例涉及用户寻找产品，并提出如下请求：

Superhero bobbleheads

对于语义搜索的爱好者来说，有一个迫在眉睫的问题。可搜索文本的主要来源是产品描述，如下所示：

 Make sure this fits by entering your model number. 
 | From the upcoming DCTV series on The CW 
 | Limited Edition of 5,000 
 | Measures approximately 12.42\" tall 
 | Sculpted by Alterton

这并没有告诉我们有关产品的任何信息。一种简单的方法可能是选择一个嵌入模型，嵌入描述，然后对其进行语义搜索。无论选择哪种嵌入，这种特定的产品都不会出现在 “Superhero bobbleheads - 超级英雄摇头娃娃” 这样的查询中。让我们尝试一下，看看会发生什么。

简单语义搜索

继续使用以下命令部署 elser_v2（确保已启用 ML 节点自动扩展）：

PUT _inference/sparse_embedding/elser_v2
{
  "service": "elser",
  "service_settings": {
    "num_allocations": 4,
    "num_threads": 8
  }
}

让我们定义一个名为 amazon_product_10k_plain_embed 的新索引，使用我们的 elser_v2 推理端点将产品描述定义为 semantic_text 类型，然后运行重新索引：

PUT amazon_product_10k_plain_embed
{
  "mappings": {
    "properties": {
      "about_product": {
        "type": "semantic_text",
        "inference_id": "elser_v2"
      }

    }
  }
}

POST _reindex?slices=auto&wait_for_completion=false
{
  "conflicts": "proceed", 
  "source": {
    "index": "amazon_product_10k",
    "size": 64
  },
  "dest": {
    "index": "amazon_product_10k_plain_embed"
  }
}

并运行语义搜索：

GET amazon_product_10k_plain_embed/_search
{
  "_source": ["about_product.text", "technical_details", "product_name"], 
  "retriever": {
    "standard": {
      "query": {
        "nested": {
          "path": "about_product.inference.chunks",
          "query": {
            "sparse_vector": {
              "inference_id": "elser_v2",
              "field": "about_product.inference.chunks.embeddings",
              "query": "superhero bobblehead"
            }
          }
        }
      }
    }
  },
  "size": 20
}

瞧。我们得到的 product_names 非常糟糕。

1.  Idea Max Peek-A-Pet Bobble Heads Flowers Corgi (Tea Cup)
2. Mezco Toyz Sons Of Anarchy 6" Clay Bobblehead
3. Funko Marvel Captain America Pop Vinyl Figure

尽管 Corgi 是摇头娃娃（bobblehead），但它并不是超级英雄（superheroes）。《Sons of Anarchy - 混乱之子》不是超级英雄，他们是摩托车爱好者，而 Funko Pop 肯定不是摇头娃娃。那么，为什么结果如此糟糕呢？

所需信息实际上位于 category 和 Technical_details 字段中：

"Category": "Toys & Games | Collectible Toys | Statues, Bobbleheads & Busts | Statues",
"technical_details": "show up to 2 reviews by default Jefferson 
Pierce returns to the superhero fold as Black Lightning, the 
star of The CW's upcoming TV series Black Lightning! 
Limited edition of 5,000. Measures approximately 12.42\" 
tall. Sculpted by Alterton. | 3.7 pounds 
(View shipping rates and policies)"

第二个问题，只有 category 告诉我们该产品是摇头娃娃（bobblehead），只有 technical_details 字段告诉我们该产品与超级英雄有关。因此，接下来要做的简单事情是嵌入所有三个字段，然后对所有三个字段进行向量搜索，并希望平均得分能使该产品位于结果的顶部附近。

除了计算和存储成本明显增加三倍之外，我们还坚信最终的三个嵌入不会产生噪音，因为产品描述非常不相关，类别和技术细节各自仅包含一个与搜索查询相关的单词。

加强版简单的语义搜索

无论如何，我们还是尝试一下，看看会发生什么。让我们嵌入三个字段：

PUT amazon_product_10k_triple_embed_3
{
  "mappings": {
    "properties": {
      "about_product": {
        "type": "semantic_text",
        "inference_id": "elser_v2"
      },
      "technical_details": {
        "type": "semantic_text",
        "inference_id": "elser_v2"
      },
      "Category": {
        "type": "semantic_text",
        "inference_id": "elser_v2"
      }
      
    }
  }
}
POST _reindex?slices=auto&wait_for_completion=false
{
  "conflicts": "proceed", 
  "source": {
    "index": "amazon_product_10k",
    "size": 64
  },
  "dest": {
    "index": "amazon_product_10k_triple_embed_3"
  }
}

并使用具有 Elastic 内置倒数排序融合的检索器运行另一次搜索。

GET amazon_product_10k_triple_embed_3/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "nested": {
                "path": "about_product.inference.chunks",
                "query": {
                  "sparse_vector": {
                    "inference_id": "elser_v2",
                    "field": "about_product.inference.chunks.embeddings",
                    "query": "superhero bobblehead"
                  }
                },
                "inner_hits": {
                  "size": 2,
                  "name": "amazon_product_10k_triple_embed_3.about_product",
                  "_source": [
                    "about_product.inference.chunks.text"
                  ]
                }
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "nested": {
                "path": "Category.inference.chunks",
                "query": {
                  "sparse_vector": {
                    "inference_id": "elser_v2",
                    "field": "Category.inference.chunks.embeddings",
                    "query": "superhero bobblehead"
                  }
                },
                "inner_hits": {
                  "size": 2,
                  "name": "amazon_product_10k_triple_embed_3.Category",
                  "_source": [
                    "Category.inference.chunks.text"
                  ]
                }
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "nested": {
                "path": "technical_details.inference.chunks",
                "query": {
                  "sparse_vector": {
                    "inference_id": "elser_v2",
                    "field": "technical_details.inference.chunks.embeddings",
                    "query": "superhero bobblehead"
                  }
                },
                "inner_hits": {
                  "size": 2,
                  "name": "amazon_product_10k_triple_embed_3.technical_details",
                  "_source": [
                    "technical_details.inference.chunks.text"
                  ]
                }
              }
            }
          }
        }
      ]
    }
  }
}

但结果实际上比以前更糟糕：

1. Sunny Days Entertainment Bendems Collectible Posable Figures - Bobs Burgers: Bob
2. Star Wars Childs Boba Fett Costume, Medium
3. DIAMOND SELECT TOYS Batman The Animated Series: Ra's Al Ghul Resin Bust Figure

我的一位同事可能会非常高兴 Bob's Burgers 的 Bob 成为超级英雄商品 @Jeff Vestal 的头号商品。

那么现在呢？

HyDE

这是一个想法。让我们使用 LLM 来提高数据质量，并使向量搜索更有效。有一种称为 HyDE 的技术。该提议非常直观。查询包含关键内容，即关键字 “Superhero” 和 “Bobblehead”。但是，它没有捕获我们实际搜索的文档的形式和结构。换句话说，搜索查询在形式上与索引文档不同，尽管它们可能有共同的内容。因此，对于关键字和语义搜索，我们将内容与内容匹配，但不将形式与形式匹配。

HyDE 使用 LLM 将查询转换为假设文档（hypothetical documents），这些文档捕获相关性模式，但不包含可以回答查询的实际内容。然后嵌入假设文档并用于向量搜索。简而言之，我们将形式与形式匹配，将内容与内容匹配。

让我们针对电子商务稍微修改一下这个想法。

查询配置文件

我所说的查询配置文件实际上是获取多个元数据字段，并将它们转换为类似于用户偏好和可能请求的一段文本。然后嵌入此查询配置文件，并对其进行后续向量搜索。指示 LLM 创建一个文档，模仿用户在搜索产品时可能要求的内容。流程如下：

我认为这种方法有两个主要优点：

将来自多个字段的信息整合到一个文档中。
捕获用户请求的可能形式，并在执行此操作时覆盖尽可能多的基础。

生成的文本信息丰富，在搜索时可能会给我们带来更好的结果。让我们实现它并看看会发生什么。

实现查询配置文件

我们将使用 LLM 处理器在 Elasticsearch 中定义一个管道。我将使用我公司的 Azure OpenAI 部署中的 GPT-4o mini，因此让我们像这样定义推理端点：

PUT _inference/completion/azure_openai_gpt4omini_completion
{
    "service": "azureopenai",
    "service_settings": {
        "api_key": <YOUR API KEY>
        "resource_name": <YOUR RESOURCE NAME>,
        "deployment_id": "gpt-4o-mini",
        "api_version": "2024-06-01"
    }
}

现在让我们定义一个包含查询配置文件 prompt 的摄取管道。

PUT _ingest/pipeline/amazon_10k_query_profile_pipeline
{
  "processors": [
    {
      "script": {
        "source": """
    ctx.query_profile_prompt = 'Given a {product}, create a detailed query 
    profile written from a customers perspective describing what they are 
    looking for when shopping for this exact item. Include key characteristics 
    like type, features, use cases, quality aspects, materials, and target user. 
    Focus on aspects a shopper would naturally mention in their search query. 
    
    Format: descriptive text without bullet points or sections. 
    
    Example: "Looking for a high-end lightweight carbon fiber road bike for 
    competitive racing with electronic gear shifting and aerodynamic frame 
    design suitable for experienced cyclists who value performance and speed."
    
    Describe this product in natural language that matches how real customers 
    would search for it. 
    
    Here are the product details: 
    \\n Product Name:\\n' + ctx.product_name 
    + '\\nAbout Product:\\n' + ctx.about_product 
    + '\\nCategory:\\n' + ctx.category  
    + '\\nTechnical Details:\\n' + ctx.technical_details
    """
      }
    },
    {
      "inference": {
        "model_id": "azure_openai_gpt4omini_completion",
        "input_output": {
          "input_field": "query_profile_prompt",
          "output_field": "query_profile"
        },
        "on_failure": [
          {
            "set": {
              "description": "Index document to 'failed-<index>'",
              "field": "_index",
              "value": "failed-{{{ _index }}}"
            }
          }
        ]
      }
    },
    {
      "remove": {
        "field": "query_profile_prompt"
      }
    }
  ]
}

我们将运行重新索引以使用我们的 LLM 集成创建新字段：

POST _reindex?slices=auto&wait_for_completion=false
{
  "conflicts": "proceed", 
  "source": {
    "index": "amazon_product_10k",
    "size": 32
  },
  "dest": {
    "index": "amazon_product_10k_w_query_profiles",
    "pipeline": "amazon_10k_query_profile_pipeline",
    "op_type": "create"
  }
}

完成后，我们将定义另一个索引，将查询 query profiles 设置为 semantic_text 数据类型，并使用 Elser 运行嵌入。我喜欢将处理分为两个阶段，这样我就可以独立于嵌入保留 LLM 的劳动成果。这可以称为对天灾的保险。

PUT amazon_product_10k_query_embed
{
  "mappings": {
    "properties": {
      "query_profile": {
        "type": "semantic_text",
        "inference_id": "elser_v2"
      }
    }
  }
}

现在让我们再次运行相同的查询，这次使用查询配置文件的语义搜索，看看我们得到什么：

GET amazon_product_10k_qp_embed/_search
{
  "_source": ["about_product.text", "technical_details.text", "product_name"], 
  "retriever": {
    "standard": {
      "query": {
        "nested": {
          "path": "query_profile.inference.chunks",
          "query": {
            "sparse_vector": {
              "inference_id": "elser_v2",
              "field": "query_profile.inference.chunks.embeddings",
              "query": "superhero bobblehead"
            }
          }
        }
      }
    }
  },
  "size": 20
}

事实上，结果要好得多。

1. FOCO DC Comics Justice League Character Bobble, Superman
2. The Tin Box Company Batman Bobble Head Bank, Black
3. Potato Head MPH Marvel Mashup Hawkeye & Iron Man Toy

好吧，结果 3 是 potato head。很好。但结果 1 和 2 是真正的超级英雄摇头娃娃，所以我会认为这是一次胜利。