Elasticsearch:运用 Java 对索引文档进行搜索

本文详细介绍了如何使用Java通过Elasticsearch进行高级搜索,包括全量搜索、范围查询、全文搜索、复合查询和高亮显示,实战演示了如何在Twitter索引中应用这些技术。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

这是这个系列文章中的其中一篇文章:

在今天的文章中,我将详述如何对索引进行搜索。在进行下面的练习之前,我们先使用 Kibana 创建如下的一个叫做 twitter 的索引:

PUT twitter
{
  "mappings": {
    "properties": {
      "DOB": {
        "type": "date"
      },
      "address": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "age": {
        "type": "long"
      },
      "city": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "country": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "location": {
        "type": "geo_point"
      },
      "message": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "province": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "uid": {
        "type": "long"
      },
      "user": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

在上面,我们创建了一个叫做 twitter 的索引。如果你对上面命令还不是很清楚的话,请参阅我之前的文章 “开始使用 Elasticsearch (2)”。我们接着使用如下的命令来导入文档:

POST twitter/_bulk
{"index":{"_id":1}}
{"user":"双榆树-张三","DOB":"1992-08-03","message":"今儿天气不错啊,出去转转去","uid":1,"age":30,"city":"北京","province":"北京","country":"中国","address":"中国北京市海淀区","location":{"lat":"39.970718","lon":"116.325747"}}
{"index":{"_id":2}}
{"user":"东城区-老刘","DOB":"1990-07-14","message":"出发,下一站云南!","uid":2,"age":32,"city":"北京","province":"北京","country":"中国","address":"中国北京市东城区台基厂三条3号","location":{"lat":"39.904313","lon":"116.412754"}}
{"index":{"_id":3}}
{"user":"东城区-李四","DOB":"1997-09-23","message":"happy birthday!","uid":3,"age":25,"city":"北京","province":"北京","country":"中国","address":"中国北京市东城区","location":{"lat":"39.893801","lon":"116.408986"}}
{"index":{"_id":4}}
{"user":"朝阳区-老贾","DOB":"1980-06-30","message":"123,gogogo","uid":4,"age":42,"city":"北京","province":"北京","country":"中国","address":"中国北京市朝阳区建国门","location":{"lat":"39.718256","lon":"116.367910"}}
{"index":{"_id":5}}
{"user":"朝阳区-老王","DOB":"1996-06-18","message":"Happy BirthDay My Friend!","uid":5,"age":26,"city":"北京","province":"北京","country":"中国","address":"中国北京市朝阳区国贸","location":{"lat":"39.918256","lon":"116.467910"}}
{"index":{"_id":6}}
{"user":"虹桥-老吴","DOB":"2000-04-05","message":"好友来了都今天我生日,好友来了,什么 birthday happy 就成!","uid":7,"age":22,"city":"上海","province":"上海","country":"中国","address":"中国上海市闵行区","location":{"lat":"31.175927","lon":"121.383328"}}

请注意上面的 DOB 代表的是 date of birth,也就是生日。我们可以使用如下的命令来进行查看文档的数量:

GET twitter/_count

上面会显示 6 个文档。

创建 Java 应用对文档进行搜索

为了方便大家对代码的理解,我把最终的代码置于 github:https://github.com/liu-xiao-guo/ElasticsearchJava-search。你可以使用如下的命令来下载代码:

git clone https://github.com/liu-xiao-guo/ElasticsearchJava-search

创建 Java 项目

我们可以参考之前的文章:

用自己喜欢的 IDE 来创建一个最为基本的 Java 项目。这里就不再累述。关于如何创建和 Elasticsearch 之间的连接,请参考上面的两篇文章。在接下来的描述中,我将详细讲解如何使用代码来进行搜索。

搜索文档

搜素一:搜索所有的文档

我们使用 Java 来搜索所有的文档:

        // Search 1: Search for all documents
        System.out.println("****************** Search 1");
        SearchRequest searchRequest = new SearchRequest();
        searchRequest.indices(INDEX_NAME);
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        searchSourceBuilder.query(QueryBuilders.matchAllQuery());
        searchRequest.source(searchSourceBuilder);
        Map<String, Object> map=null;

        try {
            SearchResponse searchResponse = null;
            searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
            if (searchResponse.getHits().getTotalHits().value > 0) {
                SearchHit[] searchHit = searchResponse.getHits().getHits();
                for (SearchHit hit : searchHit) {
                    map = hit.getSourceAsMap();
                    System.out.println("map:" + Arrays.toString(map.entrySet().toArray()));

                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }

在上面,我们使用 QueryBuilders.matchAllQuery() 来查询所有的文档。上面的命令和 Kibana 中的如下的命令是一样的:

GET twitter/_search

运行上面的代码。它的运行结果是:

****************** Search 1
map:[uid=1, country=中国, address=中国北京市海淀区, province=北京, city=北京, DOB=1992-08-03, location={lon=116.325747, lat=39.970718}, message=今儿天气不错啊,出去转转去, user=双榆树-张三, age=30]
map:[uid=2, country=中国, address=中国北京市东城区台基厂三条3号, province=北京, city=北京, DOB=1990-07-14, location={lon=116.412754, lat=39.904313}, message=出发,下一站云南!, user=东城区-老刘, age=32]
map:[uid=3, country=中国, address=中国北京市东城区, province=北京, city=北京, DOB=1997-09-23, location={lon=116.408986, lat=39.893801}, message=happy birthday!, user=东城区-李四, age=25]
map:[uid=4, country=中国, address=中国北京市朝阳区建国门, province=北京, city=北京, DOB=1980-06-30, location={lon=116.367910, lat=39.718256}, message=123,gogogo, user=朝阳区-老贾, age=42]
map:[uid=5, country=中国, address=中国北京市朝阳区国贸, province=北京, city=北京, DOB=1996-06-18, location={lon=116.467910, lat=39.918256}, message=Happy BirthDay My Friend!, user=朝阳区-老王, age=26]
map:[uid=7, country=中国, address=中国上海市闵行区, province=上海, city=上海, DOB=2000-04-05, location={lon=121.383328, lat=31.175927}, message=好友来了都今天我生日,好友来了,什么 birthday happy 就成!, user=虹桥-老吴, age=22]

从上面的输出中,我们可以看出来:它搜索到所有的结果。

搜索二:搜索一定范围的数据

        // Search 2:
        System.out.println("****************** Search 2");
        SearchSourceBuilder builder = new SearchSourceBuilder()
                .postFilter(QueryBuilders.rangeQuery("age").from(25).to(30));

        SearchRequest searchRequest2 = new SearchRequest();
        searchRequest2.indices(INDEX_NAME);
        searchRequest2.searchType(SearchType.DFS_QUERY_THEN_FETCH);
        searchRequest2.source(builder);

        try {
            SearchResponse searchResponse = null;
            searchResponse = client.search(searchRequest2, RequestOptions.DEFAULT);
            if (searchResponse.getHits().getTotalHits().value > 0) {
                SearchHit[] searchHit = searchResponse.getHits().getHits();
                for (SearchHit hit : searchHit) {
                    map = hit.getSourceAsMap();
                    System.out.println("map:" + Arrays.toString(map.entrySet().toArray()));

                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }

在上面,我们搜索年龄在 25 岁和 30 岁之间的所有文档。上面的命令类似于 Kibana 中的如下搜索:

GET twitter/_search
{
  "query": {
    "match_all": {}
  },
  "post_filter": {
    "range": {
      "age": {
        "gte": 25,
        "lte": 30
      }
    }
  }
}

运行上面的应用,搜索二的输出结果为:

****************** Search 2
map:[uid=1, country=中国, address=中国北京市海淀区, province=北京, city=北京, DOB=1992-08-03, location={lon=116.325747, lat=39.970718}, message=今儿天气不错啊,出去转转去, user=双榆树-张三, age=30]
map:[uid=3, country=中国, address=中国北京市东城区, province=北京, city=北京, DOB=1997-09-23, location={lon=116.408986, lat=39.893801}, message=happy birthday!, user=东城区-李四, age=25]
map:[uid=5, country=中国, address=中国北京市朝阳区国贸, province=北京, city=北京, DOB=1996-06-18, location={lon=116.467910, lat=39.918256}, message=Happy BirthDay My Friend!, user=朝阳区-老王, age=26]

从上面的结果中可以看出来 age 在 25 岁和 30 岁之间的文档有 3 个。

搜索三:在字段中进行全文搜索

        // Search 3:
        System.out.println("****************** Search 3");
        SearchSourceBuilder builder3 = new SearchSourceBuilder();
        builder3.from(0);
        builder3.size(2);
        builder3.timeout(new TimeValue(60, TimeUnit.SECONDS));
        builder3.query(QueryBuilders.matchQuery("user", "朝阳"));

        SearchRequest searchRequest3 = new SearchRequest();
        searchRequest3.indices(INDEX_NAME);
        searchRequest3.searchType(SearchType.DFS_QUERY_THEN_FETCH);
        searchRequest3.source(builder3);
        try {
            SearchResponse searchResponse = null;
            searchResponse = client.search(searchRequest3, RequestOptions.DEFAULT);
            if (searchResponse.getHits().getTotalHits().value > 0) {
                SearchHit[] searchHit = searchResponse.getHits().getHits();
                for (SearchHit hit : searchHit) {
                    map = hit.getSourceAsMap();
                    System.out.println("map:" + Arrays.toString(map.entrySet().toArray()));

                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }

我们在所有的文档里搜索字段 user 含有 “朝阳”,并返回第一个 page 的结果。上述搜索相当于在 Kibana 中的如下命令:

GET twitter/_search
{
  "from": 0,
  "size": 2,
  "query": {
    "match": {
      "user": "朝阳"
    }
  }
}

运行上面的代码。它的显示结果为:

****************** Search 3
map:[uid=4, country=中国, address=中国北京市朝阳区建国门, province=北京, city=北京, DOB=1980-06-30, location={lon=116.367910, lat=39.718256}, message=123,gogogo, user=朝阳区-老贾, age=42]
map:[uid=5, country=中国, address=中国北京市朝阳区国贸, province=北京, city=北京, DOB=1996-06-18, location={lon=116.467910, lat=39.918256}, message=Happy BirthDay My Friend!, user=朝阳区-老王, age=26]

上面的结果显示 user 字段含有 “朝阳”,并且它的文档数是 2,也就是 page size 是 2。

搜索四:复合查询

在很多的时候,我们使用复合查询来得到所需要的文档。关于复合查询的理解,请参阅我之前的文章 “开始使用 Elasticsearch (2)”。它一般具有如下的一个形式:

POST _search
{
  "query": {
    "bool" : {
      "must" : {
        "term" : { "user" : "kimchy" }
      },
      "filter": {
        "term" : { "tag" : "tech" }
      },
      "must_not" : {
        "range" : {
          "age" : { "gte" : 10, "lte" : 20 }
        }
      },
      "should" : [
        { "term" : { "tag" : "wow" } },
        { "term" : { "tag" : "elasticsearch" } }
      ],
      "minimum_should_match" : 1,
      "boost" : 1.0
    }
  }
}

它由 must,must_not 及 should 组成的布尔查询。

    // Search 4:
        System.out.println("****************** Search 4");
        MatchQueryBuilder matchQueryBuilder = new MatchQueryBuilder("user", "朝阳");
        MatchQueryBuilder matchQueryBuilder1 = new MatchQueryBuilder("address", "北京");

        RangeQueryBuilder rangeQueryBuilder = new RangeQueryBuilder("age").from(25).to(30);
        BoolQueryBuilder boolQueryBuilder = new BoolQueryBuilder()
                .must(matchQueryBuilder)
                .must(matchQueryBuilder1)
                .should(rangeQueryBuilder);

        SearchSourceBuilder builder4 = new SearchSourceBuilder().query(boolQueryBuilder);
        builder4.from(0);
        builder4.size(2);
        builder4.timeout(new TimeValue(60, TimeUnit.SECONDS));
        builder4.sort("DOB", SortOrder.ASC);

        SearchRequest searchRequest4 = new SearchRequest();
        searchRequest4.indices(INDEX_NAME);
        searchRequest4.searchType(SearchType.DFS_QUERY_THEN_FETCH);
        searchRequest4.source(builder4);
        try {
            SearchResponse searchResponse = null;
            searchResponse = client.search(searchRequest4, RequestOptions.DEFAULT);
            if (searchResponse.getHits().getTotalHits().value > 0) {
                SearchHit[] searchHit = searchResponse.getHits().getHits();
                for (SearchHit hit : searchHit) {
                    map = hit.getSourceAsMap();
                    System.out.println("map:" + Arrays.toString(map.entrySet().toArray()));

                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }

在上面,我们使用 must 及 should 组成的 bool 查询。它相当于在 Kibana 中的如下命令:

GET twitter/_search
{
  "from": 0,
  "size": 2, 
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "user": "朝阳"
          }
        },
        {
          "match": {
            "address": "北京"
          }
        }
      ],
      "should": [
        {
          "range": {
            "age": {
              "gte": 25,
              "lte": 30
            }
          }
        }
      ]
    }
  },
  "sort": [
    {
      "DOB": {
        "order": "asc"
      }
    }
  ]
}

在 Kibana 中运行上面的命令:

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "twitter",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : null,
        "_source" : {
          "user" : "朝阳区-老贾",
          "DOB" : "1980-06-30",
          "message" : "123,gogogo",
          "uid" : 4,
          "age" : 42,
          "city" : "北京",
          "province" : "北京",
          "country" : "中国",
          "address" : "中国北京市朝阳区建国门",
          "location" : {
            "lat" : "39.718256",
            "lon" : "116.367910"
          }
        },
        "sort" : [
          331171200000
        ]
      },
      {
        "_index" : "twitter",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : null,
        "_source" : {
          "user" : "朝阳区-老王",
          "DOB" : "1996-06-18",
          "message" : "Happy BirthDay My Friend!",
          "uid" : 5,
          "age" : 26,
          "city" : "北京",
          "province" : "北京",
          "country" : "中国",
          "address" : "中国北京市朝阳区国贸",
          "location" : {
            "lat" : "39.918256",
            "lon" : "116.467910"
          }
        },
        "sort" : [
          835056000000
        ]
      }
    ]
  }
}

我们可以看到是按照 DOB 进行排序的。

运行我们的代码:

****************** Search 4
map:[uid=4, country=中国, address=中国北京市朝阳区建国门, province=北京, city=北京, DOB=1980-06-30, location={lon=116.367910, lat=39.718256}, message=123,gogogo, user=朝阳区-老贾, age=42]
map:[uid=5, country=中国, address=中国北京市朝阳区国贸, province=北京, city=北京, DOB=1996-06-18, location={lon=116.467910, lat=39.918256}, message=Happy BirthDay My Friend!, user=朝阳区-老王, age=26]

在返回结果中,也是按照 DOB 降序来排列的。

也许有的同学要问,为啥 age 为 42 的文档 4 被搜索到了啊?这个就是 should 的作用。如果在 should 里的条件满足,那么搜索的结果就会加分。当然由于我们使用 sort 进行重新排序,所以得到的分数没有任何的意义。

搜索五:highlight

在很多的时候,我们希望搜索的结果是带有 highlight 的那么,我们该怎么办呢?我们可以参考之前的文章 “开始使用 Elasticsearch (2)” 查询 highlighting 部分。

假如我们想实现如下的 highlight:

GET twitter/_search
{
  "from": 0,
  "size": 2, 
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "user": "朝阳"
          }
        },
        {
          "match": {
            "address": "北京"
          }
        }
      ],
      "should": [
        {
          "range": {
            "age": {
              "gte": 25,
              "lte": 30
            }
          }
        }
      ]
    }
  },
  "sort": [
    {
      "DOB": {
        "order": "asc"
      }
    }
  ],
  "highlight": { 
    "pre_tags": ["<my_tag>"],
    "post_tags": ["</my_tag>"], 
    "fields": {
      "user": {}
    }
  }
}

如上所示,我们定制了 highlight 的 tag: my_tag。上面搜索的返回结果是:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "twitter",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : null,
        "_source" : {
          "user" : "朝阳区-老贾",
          "DOB" : "1980-06-30",
          "message" : "123,gogogo",
          "uid" : 4,
          "age" : 42,
          "city" : "北京",
          "province" : "北京",
          "country" : "中国",
          "address" : "中国北京市朝阳区建国门",
          "location" : {
            "lat" : "39.718256",
            "lon" : "116.367910"
          }
        },
        "highlight" : {
          "user" : [
            "<my_tag>朝</my_tag><my_tag>阳</my_tag>区-老贾"
          ]
        },
        "sort" : [
          331171200000
        ]
      },
      {
        "_index" : "twitter",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : null,
        "_source" : {
          "user" : "朝阳区-老王",
          "DOB" : "1996-06-18",
          "message" : "Happy BirthDay My Friend!",
          "uid" : 5,
          "age" : 26,
          "city" : "北京",
          "province" : "北京",
          "country" : "中国",
          "address" : "中国北京市朝阳区国贸",
          "location" : {
            "lat" : "39.918256",
            "lon" : "116.467910"
          }
        },
        "highlight" : {
          "user" : [
            "<my_tag>朝</my_tag><my_tag>阳</my_tag>区-老王"
          ]
        },
        "sort" : [
          835056000000
        ]
      }
    ]
  }
}

如上所示,“朝” 及 “阳” 分别被标注。它们是分词的结果。在返回结果的 highlight 部分,我们可以看到它们被 <my_tag> 及 </my_tag> 所标注。我们针对 use 字段进行 highlight。那么我们该如何实现这个 highlight 呢?

        // Search 5: highlight
        System.out.println("****************** Search 5");
        HighlightBuilder highlightBuilder = new HighlightBuilder()
                .postTags("<mytag>")
                .preTags("</mytag>")
                .field("user");

        MatchQueryBuilder matchQueryBuilder3 = new MatchQueryBuilder("user", "朝阳");
        MatchQueryBuilder matchQueryBuilder4 = new MatchQueryBuilder("address", "北京");

        RangeQueryBuilder rangeQueryBuilder5 = new RangeQueryBuilder("age").from(25).to(30);
        BoolQueryBuilder boolQueryBuilder5 = new BoolQueryBuilder()
                .must(matchQueryBuilder)
                .must(matchQueryBuilder3)
                .should(rangeQueryBuilder5);

        SearchSourceBuilder builder5 = new SearchSourceBuilder().query(boolQueryBuilder5);
        builder5.from(0);
        builder5.size(2);
        builder5.timeout(new TimeValue(60, TimeUnit.SECONDS));
        builder5.sort("DOB", SortOrder.ASC);
        builder5.highlighter(highlightBuilder);

        SearchRequest searchRequest5 = new SearchRequest();
        searchRequest5.indices(INDEX_NAME);
        searchRequest5.searchType(SearchType.DFS_QUERY_THEN_FETCH);;
        searchRequest5.source(builder5);
        try {
            SearchResponse searchResponse = null;
            searchResponse = client.search(searchRequest5, RequestOptions.DEFAULT);

            System.out.println(searchResponse);

        } catch (IOException e) {
            e.printStackTrace();
        }

在上面,我们添加了 highlight 的部分。运行上面的结果为:

{
   "took":25,
   "timed_out":false,
   "_shards":{
      "total":1,
      "successful":1,
      "skipped":0,
      "failed":0
   },
   "hits":{
      "total":{
         "value":2,
         "relation":"eq"
      },
      "max_score":null,
      "hits":[
         {
            "_index":"twitter",
            "_type":"_doc",
            "_id":"4",
            "_score":null,
            "_source":{
               "user":"朝阳区-老贾",
               "DOB":"1980-06-30",
               "message":"123,gogogo",
               "uid":4,
               "age":42,
               "city":"北京",
               "province":"北京",
               "country":"中国",
               "address":"中国北京市朝阳区建国门",
               "location":{
                  "lat":"39.718256",
                  "lon":"116.367910"
               }
            },
            "highlight":{
               "user":[
                  "</mytag>朝<mytag></mytag>阳<mytag>区-老贾"
               ]
            },
            "sort":[
               331171200000
            ]
         },
         {
            "_index":"twitter",
            "_type":"_doc",
            "_id":"5",
            "_score":null,
            "_source":{
               "user":"朝阳区-老王",
               "DOB":"1996-06-18",
               "message":"Happy BirthDay My Friend!",
               "uid":5,
               "age":26,
               "city":"北京",
               "province":"北京",
               "country":"中国",
               "address":"中国北京市朝阳区国贸",
               "location":{
                  "lat":"39.918256",
                  "lon":"116.467910"
               }
            },
            "highlight":{
               "user":[
                  "</mytag>朝<mytag></mytag>阳<mytag>区-老王"
               ]
            },
            "sort":[
               835056000000
            ]
         }
      ]
   }
}

从上面的输出结果中,我们可以看出来 “朝” 及 “阳” 被分别 highlight 了。

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值