Mapping and Analysis
Intro to Analysis
- Aka text analysis
- Applies to text fields / values
- Text values are analyzed when indexing documents
- The result is stored in data stuctures that are efficient for searching
- The
_sourceobject is not used when searching for documents- It contains the exact values specified when indexing a document
- An Analyzer is made of:
- Character Filters: receives the original text and adds, removes, or changes characters.
- Zero or more character filters can be present
- Character filters are applied in the order in which they are specified
- e.g. remove html tags and only keep the text
- Tokenizer: tokenizes a string, meaning split the text into tokens
- There can only be one tokenizer
- Characters could be removed such as punctuation
- e.g. Split a sentence word by word and remove punctuation and white spaces
- Token Filters: receive the output of the tokenizer as input (the tokens)
- could add, remove, or modify tokens
- zero or more token filters can be present
- Token filters are applied in the order in which they are specified
- e.g. lowercase all the tokens
Using the Analyze API
- Open up localhost Kibana Console
- Compare this query:
POST /_analyze
{
"text": "1 random sentence in the air, but then the... PUPPIES :0"
"analyzer": "standard"
}
to this query:
POST /_analyze
{
"text": "1 random sentence in the air, but then the... PUPPIES :0"
"char_filter": [],
"tokenizer": "standard",
"filter": ["lowercase"]
}
These two queries have the same output!
An analyzer is broken up into three parts: char_filter, tokenizer, and filter.
Understanding inverted indices
- A field's values are stored in one of several data structures
- The data structure depends on the field's data type
- Ensures efficient data access (e.g. searches)
- Handled by Apache Lucene, not Elasticsearch
- Inverted Indices: mapping between terms and which documents contain them
- terms are the tokens emitted by the analyzer
- terms are sorted alphabetically for performance reasons
- Inverted indices enable fast searches
- Inverted indices contain more than just terms and document IDs
- information for relevance scoring
- One inverted index per text field
- Other data types like numeric, date, etc use BKD trees
Introduction to mapping
- Mapping defines the structure of documents (fields and their data types)
- used to configure how values are indexed
- Analogy --> a table schema in a relational database
- Explicit Mapping define field mappings ourselves
- Dynamic mapping: elasticsearch generates field mappings for us
Overview of data types
Object data type
- used for any JSON object
- objects may be nested
- Mapped using the
propertiesparameter - Objects are not stored as objects in Apache Lucene
- Objects are transformed to ensure that we can index any valid JSON
Nested data type
- Similar to the
objectdata type but maintains object relationships- Useful when indexing arrays of objects
- Enables us to query objects independently
- Must use the
nestedquery
- Must use the
nestedobjects are stored as hidden documents
Keyword data type
- Used for exact matching of values
- Used for filtering, aggregations, and sorting
- e.g. searching for articles with a status of
PUBLISHED - For full-text searches, use the
textdata type instead- e.g. searching the body text of an article
How the keyword data type works
keywordfields are analyzed with thekeywordanalyzer- The
keywordanalyzer is a no-op analyzer- It outputs the unmodified string as a single token
- This token is placed into the inverted index
keywordfields are used for exact matching, aggregations, and sorting- Example:
POST /_analyze
{
"text": "1 random sentence in the air, but then the... PUPPIES :0"
"analyzer": "keyword"
}
The output will contain a single token with the text string completely untouched.
Understanding type coercion
- Data types are inspected when indexing documents
- They are validated and some valid values are rejected
- e.g trying to index an object for a
textfield
- Sometimes, providing the wrong data type is okay
PUT /coercion_test/_doc/1
{
"price": 7.4
}
PUT /coercion_test/_doc/1
{
"price": "7.4"
}
PUT /coercion_test/_doc/1
{
"price": "7.4m"
}
For the second PUT, it passes in a string "7.4" which does not match the type float in the mapping.
Elasticsearch will actually convert the string if it only contains numbers into the float type.
For the third PUT, it fails because the string contained a letter m along with 7.4, so it could not be converted to float type.
Understanding the __source object
- Contains the values that were supplied at index time
- e.g. contains
"7.4"and not the values that are indexed (7.4)
- e.g. contains
- Search queries use indexed values, not
_source- BKD trees, inverted indices, etc
_sourcedoes not reflect how values are indexed
- Keep coercion in mind if you use values from
__source
More on Coercion
- Supplying a floating point for an
integerfield will truncate it to an integer - Coercion is not used for dynamic mapping
- Supplying "7.4" for a new field will create a text mapping
- Always try to use the correct data type
- Especially the first time you index a field
- Coercion is enabled by default
- Could disable it
Understanding arrays
- There is no such things as an array data type
- Any field may contain zero or more values
- No configuration or mapping needed
- Simply supply an array when indexing a document
- Constraints:
- Array values should be of the same data type
- Coercion only works for fields that are already mapped
- If creating a field mapping with dynamic mapping, an array must contain the same data type.
Array Example:
POST /_analyze
{
"text": ["Strings are simply", "merged together."],
"analyzer": "standard"
}
Looking at the output, the multiple strings are treated as a single string and not as multiple values.
Nested Arrays
- Arrays may contain nested arrays
- Arrays are flattened during indexing
[1, [2, 3]]becomes[1, 2, 3]
Remember to use the nested data type for arrays of objects if you need to query the objects independently.
Adding explicit mappings
PUT /reviews
{
"mappings": {
"properties": {
"rating": { "type": "float" },
"content": { "type": "text" },
"product_id": { "type": "integer" },
"author": {
"properties": {
"first_name": { "type": "text" },
"last_name": { "type": "text" },
"email": { "type": "keyword" }
}
}
}
}
}
Retrieving Mappings
GET /reviews/_mapping
GET /reviews/_mapping/field/content
GET /reviews/_mapping/field/author.email
Using dot notation in field names
PUT /reviews
{
"mappings": {
//...
"author":{
"properties": {
"first_name": { "type": "text" },
"last_name": { "type": "text" },
"email": { "type": "keyword" }
}
}
}
}
can be converted to dot notation:
PUT /reviews
{
"mappings": {
//...
"author.first_name": { "type": "text" },
"author.last_name": { "type": "text" },
"author.email": { "type": "keyword" }
}
}
Adding mappings to existing indices
created_attimestamp custom mapping Example:
PUT /reviews/_mapping
{
"properties": {
"created_at": {
"type": "date"
}
}
}
How dates work in Elasticsearch
- specified in one of 3 ways
- specially formatted strings
- ms since the epoch (
long) - seconds since the epoch (
integer)
- Epoch refers to the 1st of January 1970
- Custom formats are supported
Default behavior of date fields
- 3 formats:
- A date without time
- A date with time
- ms since the epoch
- UTC timezone assumed if none is specified
- Dates must be formatted according to the ISO 8601 spec
How date fields are stored
- Stored internally as ms since the epoch
- Any valid value that you supply at index time is converted to a long value internally
- Dates are converted to UTC timezone
- The same date conversion happens for search queries, too
Don't provide UNIX timestamps for default date fields
How missing fields are handled
- All fields in Elasticsearch are optional
- You can leave out a field when indexing documents
- Some integrity checks need to be done at the application level
- e.g. have required fields
- Adding a field mapping does not make a field required
- Searches automatically handle missing fields
Overview of mapping parameters
format parameter
- Used to customize the format for
datefields- Recommended to use default format:
strict_date_optional_time||epoch_millis
- Recommended to use default format:
- Using Java's
DateFormattersyntax:- e.g.
"dd/MM/yyyy"
- e.g.
- Using built-in formats
- e.g. "
epoch_second"
- e.g. "
properties parameter
- Define nested fields for
objectandnestedfields
Example:
PUT /sales
{
"mappings": {
"properties": {
"sold_by": {
"properties": {
"name": { "type": "text" }
}
}
}
}
}
coerce parameter
- Used to enable or diable coercion of values (enabled by default)
Example:
PUT /sales
{
"mappings": {
"properties": {
"amount": {
"type": "float",
"coerce": false
}
}
}
}
Example of disabling coercion at the index field to not tediously add coercion to every field:
PUT /sales
{
"settings": {
"index.mapping.coerce": false
},
"mappings": {
"properties": {
"amount": {
"type": "float",
"coerce": true
}
}
}
}
The "amount" field overwrites the index level coerce value of false.
Introduction to doc_values
- Elasticsearch makes use of several data structures
- No single data structure serves all purposes
- Inverted indices are excellent for searching text
- They don't perform well for many other data access patterns
- "Doc Values" is another data structure used by Apache Lucene
- Optimized for a different data access pattern (document --> terms)
doc_valuesare an "uninverted" inverted index.- used for sorting, aggregations, and scripting
- could be used along side inverted indices
- Elasticsearch automatically queries the appropriate data structure
Disabling doc_values
- Set the
doc_valuesparameter tofalseto save disk space- slightly increases the indexing throughput
- Only disable doc values if you won't use aggregations, sorting, scripting
- Particularly useful for large indices; typically not worth it for small ones
- Cannot be changed without reindexing documents into new index.
- Use with caution and try to anticipate how fields will be queried
Example:
PUT /sales
{
"mappings": {
"properties": {
"buyer_email": {
"type": "keyword",
"doc_values": false
}
}
}
}
norms parameter
- Normalization factors used for relevance scoring
- Norms refers to the storage of various normalization factors that are used to compute relevance scores.
- Want to rank results as we filter them
- Norms can be disabled to save disk space
- Disable norms for fields that won't be used for relevance scoring
Example:
PUT /products
{
"mappings": {
"properties": {
"tags": {
"type": "text",
"norms": false
}
}
}
}
index parameter
- Disables indexing for a field
- Values are still stored within
__source - Useful if you won't use a field for search queries
- Saves disk space and slightly improves indexing throughput
- Often used for time series data
- Fields with indexing disabled can still be used for aggregations
Example:
PUT /server-metrics
{
"mappings": {
"properties": {
"server_id": {
"type": "integer",
"index": false
}
}
}
}
null_value parameter
NULLvalues cannot be indexed or searched- Use this parameter to replace
NULLvalues with another value - Only works for explicity
NULLvalues - The replacement value must be of the same data type as the field
- Does not affect the value stored within
__source
PUT /sales
{
"mappings": {
"properties": {
"partner_id": {
"type": "keyword",
"null_value": "NULL"
}
}
}
}
copy_to parameter
- Used to copy multiple field values into a "group field"
- Simply specify the name of the target field as the value
- Example:
first_nameandlast_name-->full_name - Values are copied, not terms/tokens
- The analyzer of the target field is used for the values
- The target field is not part of
__source
Example:
PUT /sales
{
"mappings": {
"properties": {
"first_name": {
"type": "text",
"copy_to": "full_name"
},
"last_name": {
"type": "text",
"copy_to": "full_name"
},
"full_name": {
"type": "text"
}
}
}
}
Updating existing mappings
- Suppose that product IDs may now include letters
- We need to change the
product_idfield's data type to eithertextorkeyword- We won't use the field for full-text searches
- We will use it for filtering, so the
keyworddata type is ideal
Limitations for updating mappings
- Elasticsearch field mappings cannot be changed.
- Could add new field mappings
- A few mapping parameters could be updated for existing mappings
- Being able to update mappings would be problematic for existing documents
- Text values have already been analyzed, for instance
- Changing between some data types would require rebuilding the whole data structure
- Even for an empty index, we cannot update a mapping
- Field mappings cannot be removed
- Just leave out the field when indexing documents
- The Update By Query API can be used to reclaim disk space
- The solution is to reindex documents into a new index
Reindexing documents with the Reindex API
- Reindex API moves documents from one index to another so that we don't have to
POST /_reindex
{
"source": {
"index": "reviews"
},
"dest": {
"index": "reviews_new"
}
}
__source data types
- The data type doesn't reflect how the values are indexed
__sourcecontains the field values supplied at index time- It's common to use
__sourcevalues from search results- expect a string for a
keywordfield
- expect a string for a
- modify the
__sourcevalue while reindexing - Can be handled at the application level
POST /_reindex
{
"source": {
"index": "reviews"
},
"dest": {
"index": "reviews_new"
},
"script": """
if(ctx.__source.product_id != null) {
ctx._source.product_id = ctx._source.product_id.toString();
}
"""
}
Reindex documents matching a query
POST /_reindex
{
"source": {
"index": "reviews",
"query": {
"match_all": { }
}
},
"dest": {
"index": "reviews_new"
}
}
The above code specifies a query within the "source" parameter to only reindex documents that match the query.
Reindex only positive reviews
POST /_reindex
{
"source": {
"index": "reviews",
"query": {
"range": {
"rating": {
"gte": 4.0
}
}
}
},
"dest": {
"index": "reviews_new"
}
}
Removing fields
- Field mappings cannot be deleted
- Fields can be left out when indexing documents
- Maybe we want to reclaim disk space used by a field
- Already indexed values still take up disk space
- For large data sets, this may be worthwhile
- Assuming that we no longer need the values
Example:
POST /_reindex
{
"source": {
"index": "reviews",
"__source": ["content", "created_at", "rating"]
},
"dest": {
"index": "reviews_new"
}
}
- By specifying an array of field names, only those fields are included for each document when they are indexed into the destination index.
- In other words, any fields that you leave out will not be reindexed.
Changing a field's name
POST /_reindex
{
"source": {
"index": "reviews",
},
"dest": {
"index": "reviews_new"
},
"script": {
"source": """
# Rename "content" field to "comment"
ctx.__source.comment = ctx.__source.remove("content");
"""
}
}
- Example: Ignore reviews with ratings below 4.0
POST /_reindex
{
"source": {
"index": "reviews",
},
"dest": {
"index": "reviews_new"
},
"script": {
"source": """
if(ctx.__source.rating < 4.0){
ctx.op = "noop" # can also be set to "delete"
}
"""
}
}
For "noop" value, the document will not be indexed into the destination index.
Using ctx.op within scripts
- Usually, using the
queryparameter is possible - For more advanced use cases,
ctx.opcan be used - Using the query parameter is better performance wise and is preferred
- Specifying
"delete"deletes the document within the destination index- The destination index might not be empty as in our example
- The same can be done with the Delete by Query API
Parameters for the Reindex API
- A snapshot is created before reindexing documents
- A version conflict causes the query to be aborted by default
- The destination index is not empty
Batching & Throttling
- The Reindex API performs operations in batches
- Similar to the Update by Query and Delete by Query Applies
- It uses the Scroll API internally
- This is how millions of documents can be reindexed efficiently
- Throttling can be configured to limit the performance impact
- Useful for production clusters
Defining field aliases
- Field names can be changed when reindexing documents
- Not worth it for lots of document
- Alternative: use field aliases
- Doesn't require documents to be reindexed
- Aliases can be used within queries
- Aliases are defined with a field mapping
POST /_reindex/_mapping
{
"properties": {
"comment": {
"type": "alias",
"path": "content"
}
}
}
Updating field aliases
- Field Aliases could be updated
- Only its target field though
- Simply perform a mapping update with a new path value
- Possible because aliases don't affect indexing
- It's a query-level construct
Index Aliases
- Elasticsearch also supports Index Aliases
- Used when dealing with large data volumes
Multi-field mappings
PUT /multi_field_test
{
"mappings": {
"properties":{
"description": {
"type": "text"
},
"ingredients": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Index Templates
- Index Templates API
- A way to automatically apply settings and mappings on index creation
- Works by matching index names against an index pattern
- Only a single index template can be applied to a new index
- Used for data sets that are stored in multiple indices
- e.g time series data
- Enables us to simply index documents into indices that don't already existing
- Indices can be created manually
- API request and index templates are merged (the request takes precedence)
Structure of index Templates
PUT /_index_template/my-index-template
{
"index_patterns": ["my-index-pattern*"],
"template": {
"settings": { ... },
"mappings": { ... }
}
}
PUT /_index_template/my-index-templateis the name of the index template"index_patterns": ["my-index-pattern*"]is a pattern that determines when the index template is applied"settings": { ... },is settings to apply to the new index"mappings": { ... }is field mappings to add to the new index
Priorities
- Index patterns cannot overlap by default
- Only a single index template can be applied to a new index
- Specify a
priorityparamter to handle overlapping index patterns- Defaults to zero
- The index template with the highest priority "wins"
Introduction to dynamic mapping
- JSON
String--> Elasticsearch (textfield withkeywordmapping,datefield,floatfield orlongfield) - JSON
Integer--> Elasticsearchlong - JSON
Floating pointnumber --> Elasticsearchfloat - JSON
array--> Depends on the first non-null value
Configuring dynamic mapping
PUT /people
{
"mappings": {
"dynamic": false,
"properties": {
"first_name": {
"type": "text"
}
}
}
}
Setting dynamic to false
- New fields are ignored
- They are not indexed, but still part of
__source
- They are not indexed, but still part of
- No inverted index is created for the
last_namefield- Querying the field gives no results
- Fields cannot be indexed without a mapping
- When enabled, dynamic mapping creates one before indexing values
Setting dynamic to strict
- Elasticsearch will reject unmapped fields
- All fields must be mapped explicitly
- Similar behavior as relational databases
PUT /people
{
"mappings": {
"dynamic": false,
"properties": {
"first_name": {
"type": "text"
}
}
}
}
Numeric Detection
PUT /computers
{
"mappings": {
"numeric_detection": true
}
}
When adding a doc, if the field values are string versions of numbers, dynamic mapping will set values to long or float.
Dynamic Templates
PUT /dynamic_template_test
{
"mappings": {
"dynamic_templates": [
"integer": {
"match_mapping_type": "long",
"mapping": {
"type": "integer"
}
}
]
}
}
match and unmatch parameters
- used to specify conditions for field names
- Field names must match the condition specified by the
matchparameter unmatchis used to exclude fields that were matched by thematchparameter- Both parameters support patterns with wildcards(*)