The Effect of Language in Semantic Search

Jun 30, 2023

Since we launched our PrivateGPT initiative, we have garnered significant interest from companies and individuals in over 35 countries. Most of them wanted to utilize AI solutions with a focus on privacy and security. However, a third key factor has emerged: the need for multilingual capabilities.

Both local and global companies recognize the importance of ensuring that their solutions are equally effective and reliable, regardless of the language being used. Just as they do with their products and services.

In this guide, we are sharing a range of concepts and methods/techniques that have arised during our development and research on language in semantic searches.

What is semantic search?

Semantic search is a powerful method for data search and information retrieval. It scours documents using natural language queries in a way that grasps not just the literal meaning but also the underlying semantic context. This technique forms a key component of retrieval pipelines, as it goes beyond mere exact text matches and explores the realm of overlapping semantic meanings.

One of the real perks of semantic search is its ability to extract information from data, irrespective of the data's structure. Recent advancements in Large Language Models (LLMs) have truly revolutionized the landscape of semantic search.

Elements of semantic search

Semantic search breaks down into critical steps that guarantee both accuracy and relevancy.
These steps include:

  • Content Vectorization (Embeddings): This process transforms data into numerical vectors while preserving their semantic meaning. Concepts that share similar meanings will have closely aligned numerical values, which ultimately leads to more precise search results.

  • Storing Embeddings in Vector Databases: These numerical embeddings need to be stored in specialized databases designed for vectors and their matching. Due to vector similarity, we can locate items based on their semantic meaning. This is achieved using metrics like cosine distance that gauge the likeness between two non-zero vectors.

  • Vector Search: The user's query is also morphed into an embedding. By utilizing vector similarity, it can locate the best matching results within the database. This ensures that the most semantically relevant results are prioritized at the top of the search results list.


LLMs and embeddings models

Large Language Models (LLMs) are crucial for information retrieval. An LLM is a deep learning algorithm that has the ability to identify, summarize, translate, predict, and produce text or other content. These kinds of algorithms are based on the transformer architecture but on a massive scale, gaining knowledge from vast amounts of data. The transformer architecture is mainly composed of two parts: an encoder part and a decoder part.

LLMs respond to queries using information gleaned from semantic searches. The model employs a transformer architecture that grasps long-range dependencies in language, enabling it to generate coherent and contextually appropriate responses.

However, pre-trained Language Models, while capable of creating versatile text representations, aren't ideal for tasks like information retrieval and text matching.

As transformers are currently the best machine learning models for understanding semantics, employing them to generate embeddings allows us to create higher-quality embeddings. Generally, for this use case, we utilize the transformer's encoder part, which is specifically trained for creating embeddings. The encoder part is responsible for generating an intermediate representation (a vector representation) by translating the input into a dense vector representation or embedding.


Using embeddings for semantic search

A key feature of embeddings is assigning similar numerical arrays to similar chunks of text. For instance, the phrase "Hello, how are you?" and "Hi, what's up?" would be given similar number sequences. On the other hand, a statement like "Tomorrow is Friday" would be assigned a significantly different numerical array compared to the previous two. This distinction is vital for effective semantic search.

To carry out the semantic search, we need to compute the similarity between the query and every pair of texts and then serve up the text that shares the highest similarity. By doing this, we can retrieve information that aligns more closely with the query. Harnessing the power of LLMs, we can easily respond to the query. Consequently, the quality of the embeddings is critically important, as the performance of semantic search is heavily dependent on it. Moreover, the effectiveness of the embedding model can be impacted by how it was trained. 

English language embedding models will excel in retrieving English content. However, when confronted with diverse sources in multiple languages, the embedding models might grapple with fully understanding the semantic meaning of the content and could retrieve content that isn't the closest match.


How the language affects embeddings

The effectiveness of semantic search depends on the robustness of the embedding. Therefore, relying on an embedding model tailored to a particular language may limit us. If we have multiple sources written in multiple languages, the model will not be able to produce embeddings that preserve the semantic meaning in all those languages.

To overcome this limitation, we have to rely on a multilingual embedding model. In the following example, we will use three sentences in English, Spanish, and French to compare an embedding model designed for a particular language (English) with a multilingual embedding model.



In reference to the above images: The first image demonstrates what happens when we use language-specific trained models to handle text sources in different languages. As can be seen in the image, the model is unable to create clusters effectively, i.e., we cannot correctly capture the semantics, and the embeddings do not logically fit into the vector spaces. Consequently, we will not be able to obtain the correct information during the semantic search.

In contrast, when we apply a multilingual model, we are able to understand the semantic meaning of sentences from several languages and categorize identical sentences into tightly bound clusters. This is our strategy for performing semantic searches in different languages.


Using embeddings for semantic search

Multilingual embeddings offer the capability of having embeddings for each language within the same vector space. This allows words with similar meanings, irrespective of the language, to remain close together in the vector space. For instance, the Spanish word "fútbol" and the English word "soccer" would be positioned closely in the embedding space, as they carry the same meaning in different languages.

Using multilingual embedding models as the base representation for texts, we can ensure words in the new language situated close to words in the languages the model was trained on within the embedding space.

The multilingual model we've utilized is the multilingual-e5-base, which can support approximately a hundred languages.


Conclusion

Semantic search is like a superpower when it comes to finding answers to specific queries.
But in a globalized world overflowing with a multitude of languages, it is crucial that semantic search engines maintain their accuracy and capability in multiple languages.
This is where multilingual embedding models come into play: they are critical to ensure accessibility for all.

Global companies store product information in dozens of languages and serve customers all around the world. With the help of multilingual inlays, both employees and customers can consult anything in their native language. This not only increases their convenience but also improves accuracy, allowing data to be retrieved regardless of the language in which it is stored.

If you have any ideas, questions or would like to know more about integrating multilingual capabilities into your AI solutions, please do not hesitate to contact us. We will be happy to share our impressions.

If you want to know more or collaborate with us, contact us!