Searching for Data
Introduction to searching
- There are two ways of searching;
- Request URI
- Search queries are added to the URL
- Uses Apache Lucene’s query syntax
- Only supports relatively simple queries
- Query DSL
- Search queries are defined as JSON within the request body
- More verbose, but supports all features
- Query DSL Example:
GET /products/_search
{
"query": {
"match_all": {}
}
}
Introduction to term level queries
- One group of Elasticsearch queries is called term level queries
- Used to search structured data for exact values (filtering)
- E.g. finding products where the brand equals "Nike"
- Term level queries are not analyzed
- The search value is used exactly as is for inverted index lookups
- Can be used with data types such as keyword, numbers, dates, etc.
Term level queries are case sensitive
Just don’t use term level queries for text fields!
e.g A query for “nike” works fine, but “Nike” doesn’t match anything
Searching for terms
- One of the most important search queries in Elasticsearch
- Used to query several different data types
- Text values (
keywordonly!), numbers, dates, booleans, ... - Case sensitive by default
- A
case_insensitiveparameter was added in v7.1.0
- A
- Use the
termsquery to search for multiple terms
GET /products/_search
{
"query": {
"term": {
"tags.keyword": "Vegetable"
}
}
}
- Parameter that allows to perform case insensitive searches
GET /products/_search
{
"query": {
"term": {
"tags.keyword": "Vegetable",
"case_insensitive": true
}
}
}
- Search for multiple items
GET /products/_search
{
"query": {
"terms": {
"tags.keyword": ["Soup", "Meat"]
}
}
}
Retrieving documents by IDs
Example:
GET /products/_search
{
"query": {
"ids": {
"values": ["100", "200", "300"]
}
}
}
is equivalent to this in SQL:
SELECT * FROM products WHERE _id IN ("100", "200", "300");
Range Searches
- The
rangequery is used to perform range searches - E.g.
in_stock >= 1andin_stock <= 5 - E.g.
created >= 2020/01/01andcreated <= 2020/01/31
Querying Numeric Ranges
- Products that are almost sold out example
GET /products/_search
{
"query": {
"range": {
"in_stock": {
"gte": 1,
"lte": 5
}
}
}
}
is equivalent to this in SQL:
WHERE in_stock >= 1 AND in_stock <= 5
- Boundaries not included
GET /products/_search
{
"query": {
"range": {
"in_stock": {
"gt": 1,
"lt": 5
}
}
}
}
is equivalent to this in SQL:
WHERE in_stock > 1 AND in_stock < 5
Querying Dates with timestamps
- Use the
rangequery to perform range searches - Specify one or more of the
gt,gte,lt, orlteparameters - Supports both numbers and dates
- Dates are automatically handled for
datefields- Specifying the time is optional, but recommended if possible
- Custom formats are supported through the
formatparameter - Time zones are handled with the
time_zoneparameter (UTC offset)
GET /products/_search
{
"query": {
"range": {
"created": {
"time_zone": "+01:00",
"format": "yyyy/MM/dd",
"gte": "2020/01/01 00:00:00",
"lte": "2020/01/31 23:59:59"
}
}
}
}
Prefixes, wildcards, & regular expressions
- Term level queries are used for exact matching
- Query non-analyzed values with queries that are not analyzed
- There are a few exceptions
- Querying by prefix, wildcards, and regular expressions
- Remember to still query
keywordfields
- The
prefixquery matches terms that begin with a prefix - The
wildcardquery enables us to use wildcards ?to match any single character*to match any number of characters (0-N)- Avoid placing wildcards at the beginning of patterns if at all possible
- Use the
"case_insensitive": trueparameter to ignore letter casing
Querying by prefix Example
GET /products/_search
{
"query": {
"prefix": {
"name.keyword": {
"value": "Past"
}
}
}
}
Querying by wildcard Example
GET /products/_search
{
"query": {
"prefix": {
"name.keyword": {
"value": "Past?"
}
}
}
}
Querying by wildcard Example
GET /products/_search
{
"query": {
"prefix": {
"name.keyword": {
"value": "Bee*"
}
}
}
}
Regular Expressions
- The
regexpquery matches terms that match a regular expression - Regular expressions are patterns used for matching strings
- Allows more complex queries than the
wildcardquery - The whole term must be matched
- Uses Apache Lucene regex engine (
^and$anchors not supported)
Querying by field existence
- The
existsquery matches fields that have an indexed value - Field values are only indexed if they are considered non-empty
GET /products/_search
{
"query": {
"exists": {
"field": "tags.keyword"
}
}
}
Reasons for no indexed value
- Empty value provided (
NULLor[])- The
null_valueparameter is an exception forNULLvalues
- The
- No value was provided for the field
- The
indexmapping parameter is set tofalsefor the field - The value’s length is greater than the
ignore_aboveparameter - Malformed value with the
ignore_malformedmapping parameter set totrue
Inverting the query
- The
existsquery can be inverted by using theboolquery’smust_notoccurrence type
Example:
GET /products/_search
{
"query": {
"bool": {
"must_not": [
{
"exists": {
"field": "tags.keyword"
}
}
]
}
}
}
is equivalent to in SQL:
SELECT * FROM products WHERE tags is NULL
Introduction to full text queries
- Term level queries are used for exact matching on structured data
- Full text queries are used for searching unstructured text data
- E.g. website content, news articles, emails, chats, transcripts, etc.
- Often used for long texts
- We don’t know which values a field may contain (hence “unstructured”)
- Full text queries are not used for exact matching
- They match values that include a term, often being one of many
Full text queries are analyzed with the field mapping's analyzer.
The resulting term is used for a loopup within the inverted index.
Full text queries vs term level queries
- The main difference is that full text queries are analyzed
- Term level queries aren’t and are therefore used for exact matching
- Don’t use full text queries on
keywordfields because the field values were not analyzed during indexing- That compares analyzed values with non-analyzed values
The match query
- The
matchquery is a fundamental query in Elasticsearch - The most widely used full text query
- Powerful & flexible when using advanced parameters
- Supports most data types (e.g. dates and numbers)
- Recommendation: Use term level queries if you know the input value
- If the analyzer outputs multiple terms, at least one must match by default
- This can be changed by setting the
operatorparameter to"and"
- This can be changed by setting the
- Matches documents that contain one or more of the specified terms
- The search term is analyzed and the result is looked up in the field’s inverted index
GET /products/_search
{
"query": {
"match": {
"name": "PASTA CHICKEN"
}
}
}
"PASTA CHICKEN" --ANALYZER--> ["pasta", "chicken"] --> "pasta" OR "chicken"
- Explicitly say pasta AND chicken Example:
GET /products/_search
{
"query": {
"match": {
"name": "PASTA CHICKEN"
"operator": "AND"
}
}
}
Introduction to relevance scoring
- Query results are sorted descendingly by the _score metadata field
- A floating point number of how well a document matches a query
- Documents matching term level queries are generally scored 1.0
- Either a document matches, or it doesn’t (simply filtered out)
- Full text queries are not for exact matching
- How well a document matches is now a factor
- The most relevant results are placed highest (e.g. like on Google)
Searching Multiple Fields
- Multi-match Query
- The
multi_matchquery performs full text searches on multiple fields- A document matches if at least one field is matched
- Individual fields can be relevance boosted by modifying the field name (^)
- Internally, Elasticsearch rewrites the query to simplify things for us
- By default, the best matching field’s relevance score is used for the document
- Can be configured with the
typeparameter
- Can be configured with the
GET /products/_search
{
"query": {
"multi-match": {
"query": "vegetable"
"fields": ["name", "tags"]
}
}
}
- Relevance Boost Documents
GET /products/_search
{
"query": {
"multi-match": {
"query": "vegetable"
"fields": ["name^2", "tags"]
}
}
}
Specifying a tie breaker
- By default, one field is used for calculating a document’s relevance score
- We can “reward” documents where multiple fields match with the
tie_breakerparameter- Each matching field affects the relevance score
Phrase searches
- The
match_phrasequery is similar to thematchquery in some ways - For the
match_phrasequery, the position (and thereby order) of terms matters - Terms must appear in the correct order and with no other terms in-between
- The
standardanalyzer’s tokenizer outputs term positions that are stored within the field’s inverted index- These positions are then used for phrase searches (among others)
GET /products/_search
{
"query": {
"match-phrase": {
"description": "Elasticsearch guide"
}
}
}
Leaf and compound queries
- Leaf queries search for values and are independent queries
- e.g. the
termandmatchqueries
- e.g. the
- Compound queries wrap
otherqueries to produce a result
Querying with boolean logic
- Boolean query
- match queries are usually translated into bool queries internally
GET /products/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"tag.keyboard": "Alcohol"
}
}
],
"must_not": [
{
"term": {
"tags.keyword": "Wine"
}
}
],
"should": [
{
"term": {
"tags.keyword": "Beer"
}
},
{
"match": {
"name": "beer"
}
},
{
"match": {
"description": "beer"
}
}
]
}
}
}
The must occurrence type
- Query clauses are required to match and will contribute to relevance scores
The must_not occurrence type
- Query clauses must not match and do not affect relevance scoring.
- Query clauses may therefore be cached for improved perfomance.
Important things about should
- If a
boolquery only containsshouldclauses, at least one must match - Useful if you just want something to match and reward matching documents
- If nothing were required to match, we would get irrelevant results
- If a query clause exists for
must,must_not, orfilter, noshouldclause is required to match- Any
shouldclauses are only used to boost relevance scores
- Any
minimum_should_matchbehavior enforces the must clause and any of the should clauses must match.
The filter occurrence type
- Query clauses must match
- Similar to the
mustoccurrence type - Ignores relevance scores
- This improves the performance of the query
- Query results can be cached and reused
Query execution contexts
- Answers two questions;
- "Does this document match" (yes/no)
- "How well does this document match" (
_scoremetadata field)
- Query results are sorted by
_scoredescendingly- The most relevant documents appear at the top
- The query execution context calculates relevance scores
Filter execution context
- Only answers one question: "Does this document match?" (yes/no)
- No relevance scores are calculated
- Used to filter data, typically on structured data (dates, numbers,
keyword)- Relevance scoring is irrelevant if we just want to filter out documents
- Improves performance
- No resources are spent calculating relevance scores
- Query results can be cached
Changing the execution context
- It’s sometimes possible to change the execution context
- Only a few queries support it, though
- Typically done with the
boolquery andfilteraggregation - Queries that support this generally have a
filterparameter
boosting Query
- The
boolquery enables us to increase relevance scores withshould - The
boostingquery can decrease relevance scores withnegative- Documents must match the
positivequery clause - Documents that match the
negativequery clause have their relevance scores decreased - Use
match_allquery forpositiveif you don’t want to filter documents - Can be used with any query (including compound queries, such as
bool)
- Documents must match the
GET /products/_search
{
"size": 20,
"query": {
"boosting": {
"positive": {
"match": {
"name": "juice"
}
},
"negative": {
"match": {
"name": "apple"
}
},
"negative_boost": 0.5
}
}
}
Disjunction max (dis_max)
- The
dis_maxquery is a compound query- A document matches if at least one leaf query matches
- The best matching matching query clause’s relevance score is used for a
document’s
_score tie_breakercan be used to “reward” documents that match multiple queriesmulti_matchqueries are often translated intodis_maxqueries internally
dis_max query
- The best matching field’s relevance score is used
GET /products/_search
{
"query": {
"dis_max": {
"queries": [
{
"match": {
"name": "vegetable"
}
},
{
"match": {
"tags": "vegetable"
}
}
]
}
}
}
Querying nested objects
- Problem:
- When indexing arrays of objects, the relationships between values are not maintained
- Queries can yield “unpredictable” results
- Solution:
- Use the
nesteddata type if you want to query objects independently- Otherwise the relationships between object values are not maintained
- Each nested object is indexed as a hidden Lucene document
- Use the
nestedquery on fields with thenesteddata type- Elasticsearch then handles everything automatically
- Create a new index to update the field mapping & reindex documents
- Nested Example:
GET /recipes/_search
{
"query": {
"nested": {
"path": "ingredients",
"query": {
"bool":{
"must": [
{
"match": {
"ingredients.name": "parmesan"
}
},
{
"range": {
"ingredients.amount": {
"gte": 100
}
}
}
]
}
}
}
}
}
For the example, above suppose we have a list of different recipes with ingredients for each one.
The query above will prevent the issue of showing all recipes containing both parmesan and an ingredient count over 100 for any other ingredients.
The query will just show recipes containing parmesan of at least count 100.
How documents are stored
- Matching child objects affect the parent document's relevance score
- Elasticsearch calculates a relevance score for each matching child object
- This is because each nested object is a Lucene document
- Use the
score_modeparameter to adjust relevance scoring
Nested inner hits
- With the
nestedquery, matches are “root documents”- E.g. recipes when searching for ingredients
- Sometimes we might want to know what matched instead of just something
- Nested inner hits tell us which nested object(s) matched the query
- E.g. which ingredient(s) matched in a recipe
- Without inner hits, we only know that something matched
- Simply add the
inner_hitsparameter to thenestedquery - Supply
{}as the value for the default behavior - Information about the matched nested object(s) is added to search results
- Use the
offsetkey to find each object's position within_source - Customize results with the
nameandsizeparameters - Example with
inner_hits
GET /recipes/_search
{
"query": {
"nested": {
"path": "ingredients",
"inner_hits": {
"name": "my_hits",
"size": 10
},
"query": {
"bool":{
"must": [
{
"match": {
"ingredients.name": "parmesan"
}
},
{
"range": {
"ingredients.amount": {
"gte": 100
}
}
}
]
}
}
}
}
}
inner_hits can take two parameters:
nameenables us to change the key that appears directly within the inner_hitssizeenables us to configure how many inner hits we want to be returned for each matching document.
Nested fields limitations
Performance
- Indexing and querying
nestedfields is more expensive than for other data types- Keep performance in mind when using
nestedfields - If you map documents well, you should be all good, though
- Denormalizing data is often a good idea
- Elasticsearch has a few settings to prevent things from going wrong
- Keep performance in mind when using
- An Apache Lucene document is created for each nested object
- Increased storage & query costs
- Important to remember for large datasets
- Elasticsearch provides safeguards to reduce the risk of performance bottlenecks
Limitations
- We need to use a specialized data type (
nested) and query (nested) - Max 50
nestedfields per index- Can be increased with the
index.mapping.nested_fields.limitsetting (not recommended)
- Can be increased with the
- 10,000 nested objects per document (across all
nestedfields)- Protects against out of memory (OOM) exceptions
- Can be increased with the
index.mapping.nested_objects.limitsetting (not recommended)