Inverted Indices
In the analysers and analyzer API sections we saw how ES tokenizers data.
How ES stores this data is in an inverted index, which is a core search concept.
Remember that ES is built ontop of Lucene. Apache Lucene is actually the underlying tech that creates the inverted index
What is an inverted index?
An inverted index is a mapping between terms and the documents that contain them. The "terms" are really the tokens generated in ES' text analysis.
ES takes the following string: 2 guys walk into a bar, but the third... DUCKS!:-)
.
Performs text analysis on it (and thus tokenization) to output: ["2", "guys", "walk", "into", "a", "bar", "but", "the", "third", "ducks"]
.
This is then stored in an inverted index:
Inverted indexes allow fast search
Different inverted indices
When ES does text analysis it detects the data types of each token.
For example: 4 legged table!
is processed as:
{
"tokens": [
{
"token": "4",
"start_offset": 0,
"end_offset": 1,
"type": "<NUM>",
"position": 0
},
{
"token": "legged",
"start_offset": 2,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "table",
"start_offset": 9,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
}
]
2
is detected as the NUM
type, whereas table
and legged
are alphanumeric (ALPHANUM).
ES only creates inverted indices for text/alphanumeric/ALPHANUM fields.
Other data types are instead stored in a BKD tree.