OpenSearch and Elasticsearch Vector Search: An Introduction

Explaining how vector search work, benefits and use cases with Elasticsearch and OpenSearch.

Search engines like Elasticsearch, OpenSearch, and Solr have forever supported term-based searches where you’d use an inverted index to find documents with a given word or phrase easily. This is often referred to as “keyword search.”

Keyword search is great but it comes with a lot of shortcomings, from ranking challenges to not capturing semantic context and supporting synonyms in search, for example.

Lately, significant advancements have been made that bring vector search to Elasticsearch and OpenSearch. Now built into the Lucene engine itself, which is the core of those engines, vector search capabilities can bring more power to the table when it comes to search.

In this article we will explore not only the differences, but also the surprising similarities, between keyword and vector search and learn how combining them can lead to superior results.

What Is Vector Search?

Vector search is a method used in information retrieval and machine learning where documents or data points are represented as vectors in a high-dimensional space. Each vector dimension corresponds to a distinct data characteristic or attribute.

The more characteristics shared between vectors, the closer they will be located in the vector space. This is why vectors representing text documents will be closer when the text is about a similar theme.

In order to be able to perform vector search, raw data such as text, photos, audio, or video need to be embedded using an embedding function to create the vectors.

Then, by evaluating the distance between different vectors, a vector-search engine can find similar results based on their proximity in this space. Allowing for quick and scalable retrieval of similar records.

Benefits and Challenges of Vector Search

Vector search brings search to an entirely new level because it captures semantics much better than other search methods.

Common use cases for vector search include:

Semantic search – Vector search lets you find what users mean and goes beyond exact keyword match, opening the door to finding synonyms, analogs, and taxonomies.
Recommendation engines – Similar documents and their vectors in the embedding space get identified by the model that creates the embedding.
Image and audio search - This refers to embedding meaning in vectors for aspects that aren’t part of the text for similarity search or finding similar items based on various definitions of similarity. For example, you can search for images that are similar in size, color, or content or song in a genre that you prefer.

But how do you transform text into vectors? As it turns out, if you’re using Elasticsearch or OpenSearch, you have been doing this for a long time already! Let’s first understand how to create vectors from text and then see how those vectors can be searched for and ranked.

Keyword search under the hood

At their core, both Elasticsearch and OpenSearch use BM25 to rank documents for keyword search. BM25 is a ranking function that builds on the Bag of Words (BoW) representation by introducing term weighting and document length normalization. To better understand how keyword search works, let’s first explore how BoW transforms text into vectors.

For this example, let’s consider the input sentence:

"Elasticsearch and OpenSearch are great for both keyword and vector search."

Here are the steps BoW takes to generate the output vector:

Normalize – Convert to lowercase and remove punctuation/non-word characters.
Output:
"elasticsearch and opensearch are great for both keyword and vector search"

Tokenization – Split the text into an array of words.
Output:

['and', 'and', 'are', 'both', 'elasticsearch', 'great', 'for', 'keyword', 'opensearch', 'vector']

Build Vocabulary & Compute Frequencies – Each word is counted within the document and also tracked across the entire corpus (if we had multiple documents).
Output:
```
{'and': 2, 'are': 1, 'both': 1, 'elasticsearch': 1, 'great': 1, 'for': 1, 'keyword': 1, 'opensearch': 1, 'vector': 1}
```
In our case, since we only have one document, term frequency (TF) and document frequency (DF) are identical. In a larger corpus, the document frequency would track how many documents contain each term, while the term frequency would indicate how often the term appears within each document.

The vector representing our sentence in a larger corpus of text would look something like this:
```
[0,0…, 2,1,0,0…,1,1,...]
```
Most values would be 0, hence this type of vector is also known as a sparse vector.
Searching for Documents
When a user searches for the word "vector," Elasticsearch queries its inverted index and retrieves all documents where the word appears at least once. From there, BM25 ranks results based on relevance, factoring in how frequently the term appears and how rare it is across the corpus.

However, as we mentioned earlier, this approach has major limitations:

Word order is ignored - "cat chases dog" vs. "dog chases cat" are treated the same.
Synonyms and related concepts are not captured - "smartphone" and "mobile phone" would be created as different terms.
Sparse and high-dimensional - since each word gets its own axis in the vector space, the resulting vectors are very large.

Embeddings and Dense Vectors

To overcome these limitations, embedding models are AI models that are trained to create dense vector representations that capture semantic meaning rather than just word occurrence.

If we use the previous text example and create an embedding vector using an embedding model (such as OpenAI embeddings, Cohere embedding, etc.) it would look something like this:

[ 0.12, -0.43, 0.87, -0.33, 0.02, ..., -0.17, 0.45, 0.21 ]

This vector has a much lower dimensionality, depending on the embedding model, but it’s highly expressive (hence the term dense vector), and while it might be unclear to the naked eye how each value relates to the original sentence, it will capture the meaning of the original sentence, rather than individual words.

We can store the vector alongside the original text, to use it later, for search. Then if we want to find documents to answer the question:

"Which database is best for semantic search?"

We would convert the question into a vector using the same embedding algorithm:

[ 0.15, -0.47, 0.95, -0.32, 0.01, ..., -0.55, 0.48, 0.22 ]

Then we would search for the documents closest to it (In terms of cosine similarity or Euclidean distance), and retrieve the relevant results, even if none of the exact keywords are present.

Vector Search: A Dive Into kNN vs ANN

The final piece of the puzzle is understanding how Elasticsearch and OpenSearch retrieve the closest vectors for a given query.

To do this, they rely on two main techniques: k-Nearest Neighbors (k-NN) and Approximate Nearest Neighbors (ANN), both of which scan through documents with embeddings to identify the most relevant matches.

k-NN (k-Nearest Neighbors) performs an exhaustive brute-force search, calculating the distance between the query vector and every vector in the index. This method guarantees 100% accuracy but is computationally expensive, making it impractical for large-scale datasets. It is best suited for smaller indices where precision is prioritized over speed.
ANN (Approximate Nearest Neighbors), introduced in Elasticsearch 8.0, improves search speed by using an optimized indexing algorithm, such as HNSW (Hierarchical Navigable Small World). This approach trades off some accuracy for significantly faster retrieval, making it ideal for large-scale semantic search applications.

In short, k-NN ensures perfect accuracy at the cost of performance, while ANN prioritizes speed, making it the preferred choice for production use cases.

Summary

We explored the evolution of search in Elasticsearch and OpenSearch, from traditional keyword-based retrieval using BM25 to semantic search powered by dense vector embeddings.

In other articles, we dive deeper into embedding models, show you how to generate embeddings for text (for instance using Elasticsearch ELSER), and walk through setting up hybrid search that combines keyword and vector search in both Elasticsearch and OpenSearch. Also worth noting is our Elasticsearch vs OpenSearch comparison specifically in the context of vector search.