Elasticsearch：按类型删除分词

Elastic 中国社区官方博客

于 2023-04-16 19:55:19 发布

阅读量513

点赞数

CC 4.0 BY-SA版权

分类专栏： Elasticsearch Elastic 文章标签： elasticsearch 大数据搜索引擎数据库 Logstash

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/UbuntuTouch/article/details/130186895

Elastic 同时被 2 个专栏收录

1942 篇文章

订阅专栏

Elasticsearch

1314 篇文章

订阅专栏

文章介绍了如何在Elasticsearch中使用Keep类型分词过滤器来保留或删除特定类型的分词，如排除数字或ALPHANUM类型，以优化文本分析过程。通过示例展示了设置exclude和include模式来控制过滤行为。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在我之前的文章 “Elasticsearch：分词器中的 token 过滤器使用示例”，我有很多示例展示如何使用分词器中的过滤器来对分词进行过滤。在今天的文章中，我将展示如何使用另外一种过滤器根据类型来保留或者移除一些分词。

保留类型分词过滤器能够跨类型保留或删除分词。让我们想象一下项目描述字段，通常这个字段接收带有单词和数字的文本。为所有文本生成分词可能没有意义，为了避免这种情况，我们将使用 Keep 类型分词过滤器。

删除数字标记

要删除数字类型，请将 “types” 参数设置为 “<NUM>”，此参数接受一个标记列表。 “mode” 参数设置为 “exclude”。

例子：

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "keep_types",
      "types": [ "<NUM>" ],
      "mode": "exclude"
    },
    {
      "type": "stop"
    }
  ],
  "text": "The German philosopher and economist Karl Marx was born on May 5, 1818."
}

上面命令返回的分词为：

{
  "tokens": [
    {
      "token": "The",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "German",
      "start_offset": 4,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "philosopher",
      "start_offset": 11,
      "end_offset": 22,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "economist",
      "start_offset": 27,
      "end_offset": 36,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "Karl",
      "start_offset": 37,
      "end_offset": 41,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "Marx",
      "start_offset": 42,
      "end_offset": 46,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "born",
      "start_offset": 51,
      "end_offset": 55,
      "type": "<ALPHANUM>",
      "position": 8
    },
    {
      "token": "May",
      "start_offset": 59,
      "end_offset": 62,
      "type": "<ALPHANUM>",
      "position": 10
    }
  ]
}

从上面的输出中，我们可以看出来所以的数字分词都被移除了。

我们也可以尝试使用如下的命令来保留数字：

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "keep_types",
      "types": [ "<NUM>" ],
      "mode": "include"
    },
    {
      "type": "stop"
    }
  ],
  "text": "The German philosopher and economist Karl Marx was born on May 5, 1818."
}

上面的分词为：

{
  "tokens": [
    {
      "token": "5",
      "start_offset": 63,
      "end_offset": 64,
      "type": "<NUM>",
      "position": 11
    },
    {
      "token": "1818",
      "start_offset": 66,
      "end_offset": 70,
      "type": "<NUM>",
      "position": 12
    }
  ]
}

删除 aphanumeric 分词

要删除文本，我们只需将 “types” 字段设置为“<ALPHANUMERIC>”。

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "keep_types",
      "types": [ "<ALPHANUM>" ],
      "mode": "exclude"
    },
    {
      "type": "stop"
    }
  ],
  "text": "The German philosopher and economist Karl Marx was born on May 5, 1818."
}

现在我们只有数字分词。

{
  "tokens": [
    {
      "token": "5",
      "start_offset": 63,
      "end_offset": 64,
      "type": "<NUM>",
      "position": 11
    },
    {
      "token": "1818",
      "start_offset": 66,
      "end_offset": 70,
      "type": "<NUM>",
      "position": 12
    }
  ]
}