Elasticsearch：Script aggregation （1）

原创于 2020-09-23 15:14:10 发布 · 1.8k 阅读

4 ·

CC 4.0 BY-SA版权

本文为博主原创文章，未经博主允许不得转载。

文章标签：

#elasticsearch #大数据 #数据库

Elastic 同时被 2 个专栏收录

1942 篇文章

订阅专栏

Elasticsearch

1314 篇文章

订阅专栏

使用默认聚合时，开发人员通常无法获得预期的结果。基本聚合功能也有局限性。例如，如果要更改直方图的偏移值，就是这种情况。由于 Elasticsearch 不提供此本机功能，因此我们使用脚本来获取所需的结果。我们还将介绍其他使用脚本的聚合任务。在我之前的文章 “开始使用Elasticsearch （3）”，它里面也有一些涉及。在今天的文章中，我们我们来做更进一步的探讨。

这个系列的文章有两篇：

准备数据

为了支持下面的示例，我们提供了一个包含虚拟公司员工详细信息的文档集。我们包括每个雇员的数据，包括姓名，年龄，职位和薪水。我们创建一个 employee 索引：

PUT /employee/_doc/1
{
  "name": "Bob",
  "age": 35,
  "about": "Bob joined the company as a full time technology consultant in the year 2012",
  "position": "consultant",
  "salary": 5000,
  "experience": "3-years",
  "married": 1,
  "fullTime": true
}

PUT /employee/_doc/2
{
  "name": "Jack",
  "age": 30,
  "about": "Jack joined the company as a part time management consultant in the year 2013",
  "position": "Management consultant",
  "salary": 3000,
  "experience": "3-years",
  "married": 0,
  "fullTime": false
}

PUT /employee/_doc/3
{
  "name": "Tom",
  "age": 33,
  "about": "Tom is serving as the operations manager of the firm from the year 2011",
  "position": "Operations manager",
  "salary": 7000,
  "experience": "7-years",
  "married": 1,
  "fullTime": true
}

我们在 Kibana 中的 Dev Tools 里运行上面的三个命令。这样就生产了一个叫做 employee 的索引。为简洁起见，在本教程中，我们仅索引三个文档。当然，你可以更改这些示例中的值并为更多文档建立索引。

使用脚本更改默认直方图值

假设我们的主管需要一个直方图，以便她可以了解每个指定工资间隔（箱）中有多少员工。直方图应分为 3,000 美元的间隔。现在，我们有了间隔和数据来执行直方图聚合。这种聚合存在一个问题：由于间隔为 3,000，因此简单的细分将导致分界点为 3,000、6,000、9,000，依此类推。我们可以使用正常的直方图统计如下：

GET employee/_search
{
  "size": 0,
  "aggs": {
    "NAME": {
      "histogram": {
        "field": "salary",
        "interval": 3000
      }
    }
  }
}

上面的聚合显示的结果为：

  "aggregations" : {
    "NAME" : {
      "buckets" : [
        {
          "key" : 3000.0,
          "doc_count" : 2
        },
        {
          "key" : 6000.0,
          "doc_count" : 1
        }
      ]
    }

上面显示在 3000到 6000 之间有2个文档，在 6000 以上只有一个文档。

我们的主管澄清并告诉我们，我们需要细分数据，以便我们了解谁的薪水范围为 0-3,000，然后为 3,000-6,000，依此类推。这是对直方图值的偏移操作，不能使用常规的 Elasticsearch 聚合功能来完成。当然，本文的重点是我们确实可以使用脚本来完成此任务。

这是可以帮助我们的查询：

GET employee/_search
{
  "size": 0, 
  "aggs": {
    "histogramData": {
      "histogram": {
        "field": "salary",
        "interval": 3000,
        "script": "_value + 2000"
      }
    }
  }
}

该脚本将值 2,000 添加到默认偏移值，然后根据给定的值（3,000）计算间隔。这会将偏移值扩展到 8,000。由于间隔步长为3,000，因此脚本将计算最后一个直方图间隔为 6000-9000。现在，我们可以根据所需的时间间隔对员工进行明确的划分。

上面的显示结果为：

  "aggregations" : {
    "histogramData" : {
      "buckets" : [
        {
          "key" : 3000.0,
          "doc_count" : 1
        },
        {
          "key" : 6000.0,
          "doc_count" : 1
        },
        {
          "key" : 9000.0,
          "doc_count" : 1
        }
      ]
    }

关于这个 offset 的用法，你可以参阅我之前的文章 “Kibana：运用 agggregation 的高级设置来微调统计结果”。

使用脚本把字段中的值分拆

接下来，我们将从特定字段中仅提取特定数据进行聚合。索引中的文档包含 experence 字段，其值的形式为“ x-year”（其中 “x” 为数字）。

如果我们尝试进行常规聚合，则我们获得的存储桶将具有 “3-years”，“4-years” 和 “7-years” 之类的名称。但是，假设我们需要将存储段名称分别为 “ 3”，“4” 和 “7”。可以通过字符“-”进行分拆，然后仅使用分拆后的第一个元素来完成此操作。如果我们想对字段进行拆分，我们需要把这个字段定义为 keyword 类型。为此，我们重新定义一个索引：

PUT employee_new
{
  "mappings": {
    "properties": {
      "experience": {
        "type": "keyword"
      }
    }
  }
}

我们通过 reindex 的方法把之前的 employee 的数据导入：

POST _reindex
{
  "source": {
    "index": "employee"
  },
  "dest": {
    "index": "employee_new"
  }
}

我们聚合的脚本如下：

GET employee_new/_search
{
  "size": 0, 
  "aggs": {
    "urls": {
      "terms": {
        "field": "experience", 
        "script": "_value.substring(0,1)"
      }
    }
  }
}

上面聚合显示的结果为：

  "aggregations" : {
    "urls" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "3",
          "doc_count" : 2
        },
        {
          "key" : "7",
          "doc_count" : 1
        }
      ]
    }

在上面，我们看到它显示的 key 为 3 和 7，也就是 3 年及 7 年。我们也也可以把聚合修改为：

GET employee_new/_search
{
  "size": 0, 
  "aggs": {
    "urls": {
      "terms": {
        "script": "doc['experience'].get(0).substring(0,1)"
      }
    }
  }
}

它和上面的结果是一样的。

使用脚本在多个字段上执行术语聚合

使用术语聚合时，通过对多个字段执行聚合，我们可能会获得更多好处。假设我们要在 about 字段上进行术语聚合。默认字词汇总只会为我们提供最热门字词的文档计数。我们可能还需要在 position 字段执行另一个术语汇总，这将返回该字段的主要术语的文档计数。进一步研究此示例，我们可以看到我们如何在两个字段上都需要一个术语聚合，这对于在相同存储桶中需要两个聚合结果的情况非常重要。

Elasticsearch 术语聚合中没有可用的此类选项。因此，让我们尝试使用脚本，这实际上很简单。这是我们如何在 “about” 和 “position” 字段上进行字词汇总的方法。

GET employee/_search
{
  "size": 0, 
  "aggs": {
    "union_demo": {
      "terms": {
        "size": 30,
        "script": "doc['about.keyword'].value + ' ' + doc['position.keyword'].value"
      }
    }
  }
}

请注意，在这里，我们给定了一个 size 参数，将其值设置为 30。之所以这样做，是因为因为 about 字段包含许多单词，所以存储桶的数量将大于 10。 Elasticsearch 术语聚合将仅显示 10。此脚本中的此查询将向我们显示从两个字段聚合的术语的并集。

上面聚合显示的结果为：

  "aggregations" : {
    "union_demo" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Bob joined the company as a full time technology consultant in the year 2012 consultant",
          "doc_count" : 1
        },
        {
          "key" : "Jack joined the company as a part time management consultant in the year 2013 Management consultant",
          "doc_count" : 1
        },
        {
          "key" : "Tom is serving as the operations manager of the firm from the year 2011 Operations manager",
          "doc_count" : 1
        }
      ]
    }

在上面我们把两个字段 position 及 about 组合成一个新的字段来做一个聚合。