Elasticsearch - Query Large Datasets

General

PCI Wallet and Loanpro use Elasticsearch to make certain searches more powerful. Unfortunately, Elasticsearch only supports returning a maximum of 10,000 results at a time. This article talks about a strategy to help you search large datasets. We have some additional information about building Elasticsearch query objects.

When building an Elasticsearch query, you can choose to use the "from" and "size" keys. The "from" key lets you specify the record from which returned results will start, and "size" lets you specify the number of records returned. Elasticsearch has a limit of 10,000 results at a time. If you are expecting more than 10,000 results from an Elasticsearch query, you will need to do an additional request to get the next 10,000 results. To get around this limitation you can use the "search_after" key to specify which record the search should start with. There are a few things you need to do to your query to get this to work. First, you need to sort the records in your query by a unique field, such as "id". 

For example, if you want to search PCI Wallet transactions and sort your results, your payload will look something like this:

{  
"search":{
"query":{
"bool":{
"must":[
{
"bool":{
"should":{
"query_string":{
"query":"*",
"fields":[
"customer_name",
"processor",
"metadata",
"status",
"batch",
"_id",
"id",
"amount"
],
"default_operator":"AND"
}
}
}
}
]
}
},
"aggregations":{
"aggs":{
"terms":{
"field":"batch.keyword",
"size":100
}
}
},
"sort": [
{"id": "asc"}
]
,
"size":10,
"from":0
}
}

Note that a "sort" array is included in the payload. We are sorting by "id", which is guaranteed to be unique (this is the PCIW transaction ID, not the one assigned by the search backend). In case you don’t want to use "id" and the field you use is not unique, you need to include a tiebreaker field,  otherwise search results may be duplicated or missing. The query above will return documents with an additional array of "sort" values, like this:

{  
"_index":"1175",
"_type":"transaction",
"_id":"the-search-transaction-id",
"_score":null,
"_source":{
// the document body goes here
},
"sort":[
12345
]
}

To retrieve the next page, submit the same search query, with a "search_after" key using the "sort" value of the last document:

{  
"search":{
"query":{
"bool":{
"must":[
{
"bool":{
"should":{
"query_string":{
"query":"*",
"fields":[
"customer_name",
"processor",
"metadata",
"status",
"batch",
"_id",
"id",
"amount"
],
"default_operator":"AND"
}
}
}
}
]
}
},
"aggregations":{
"aggs":{
"terms":{
"field":"batch.keyword",
"size":100
}
}
},
"search_after": [12345],
"sort": [
{"id": "asc"}
],
"size":5,
"from":0
}
}

For this to work you need to use an unique field as sort as mentioned before, "from" must be 0 or -1, and "size" must be less than 10,000.


How did we do?


Powered by HelpDocs (opens in a new tab)