The Hitchhiker’s Guide to Vector Embeddings
By Carter Rabasa (pictured), Head of Developer Relations at DataStax
Businesses thinking about embarking their first adventure in Generative AI, might be wondering where to start. GenAI has opened up a host of new and exciting use cases for developers. By now everyone has seen a fair share of chat bots, but there are many other use cases for GenAI such as building intelligent agents, content creation experiences with text/audio/video, synthetic data, language translation and so much more.
Underpinning most of those experiences is something called vector embeddings. Vector embeddings are the assembly language of AI, they are how you go from a natural language query like “Who wrote the Hitchhiker’s Guide to the Galaxy” to the correct response of “Douglas Adams.” Vector embeddings allow developers to operate on unstructured data, whether it’s coming from the user in the form of a prompt or coming from documents, PDFs and files that form the knowledge base that you’re building your app on.
This article will detail vector embeddings, what they are, and how they fit into building GenAI apps.
First step: Unstructured data
There are generally two types of data: structured and unstructured. Structured data is what most developers work with every day, whether it’s a variable in your code or a field in your database. This data is often typed (numbers, strings, dates, etc) and modern databases are very good at efficiently storing and querying this kind of data.
Unstructured data, on the other hand, represents everything else: documents, articles, web pages, videos, and audio — you name it. Google, famously, built one of the first large-scale consumer applications that operated on unstructured data (web pages) and had to more or less build completely new technology to power their search engine.
GenAI applications also rely heavily on unstructured data. When building a modern search engine that understands natural language queries, you’ll need to be able to operate on the documents or knowledge base that users are trying to get insights from. Building most GenAI apps will involve taking unstructured data, storing it in a database, often together with structured data, and then retrieving it based on a user’s request. Retrieval should be both fast and return only the most relevant information.
Ensuring accurate, efficient retrieval is a major challenge that arises with unstructured data. It’s essential for GenAI applications, especially those that rely on retrieval-augmented generation (RAG). Reducing query time and maintaining loose data relationships are key factors in retrieval efficiency, but traditional data stores often lack the necessary capabilities or are just too slow to meet these demands. Ultimately, it all comes down to the indexing of the data.
Next step: Vector embeddings
“The main challenge with unstructured data is that it can be virtually anything. There is no predetermined set of formats and data types, so it’s impossible to come up with a generic algorithm to index such data.
The primary way to get around this problem is to introduce an intermediate state the data is converted to before being understood and indexed by a database. In this intermediate state, unstructured data is represented as vectors (or arrays) of floating-point numbers. These vectors are called ’embeddings’.
One way to think about embeddings is that they’re a representation of your data in a multidimensional space. Pieces of content that are semantically similar to each other will reside close to each other in this space.”
As a simple example, let’s say we want to map the following set of words into a two-dimensional space:
- Sheep
- Cow
- Pig
- Kangaroo
- Coffee
- Tea
The result could look something like this:
Cow, sheep, and pig are all domestic animals, so there is very little distance between them. Kangaroo is a wild animal, so it’s located slightly away from the group of domestic animals, but it’s still an animal, so it’s not too far away. Conversely, coffee and tea end up in a completely different area of this two-dimensional space.
You can now use this representation to find words that are similar to your inputs. For example, to answer the question “What is similar to milk?”, this will generate a vector embedding for the word “milk,” compare it to vectors that already exist in the space and find the closest ones. Milk is a drink, so the result is going to be near “coffee” and “tea.”
In this case, these words have been manually mapped onto this graph based on the intuitive understanding of the meaning of these words. In a real app, words will be replaced with larger pieces of content, the space will have hundreds or even thousands of dimensions rather than two, and will need a way to automate the generation of these vector embeddings.
Translating to vector embeddings
When building GenAI apps, businesses can use machine learning models to convert unstructured data into vector embeddings. There are dozens (hundreds? thousands?) of machine learning models to choose from which all generate vectors that form clusters based on how they were trained. Some are models optimised for text, or images and even models like CLIP can handle both text and images!
One of the most popular providers for text embeddings models is OpenAI.
A fork in the road: Choosing an embedding model
Every application is different and selecting the right embedding model is crucial for building a successful app that users love.
First and foremost, consider the relevance of the model to the specific use case. Not all models are created equal; some are fine-tuned for particular types of text or applications. When working with casual conversation data, a general-purpose model can suffice. However, building an application for a specific domain where jargon and specific terminology are abundant, consider using a domain-specific model. Models specially trained for medical, scientific, legal, and other specific data are better equipped to understand and generate embeddings that accurately reflect the specialised language used in that field.
Consider the overall quality and robustness of the model. Look into the training data used, the size of the model, and its performance metrics. Many LLM providers provide benchmarks and comparisons that can help make this decision. “Ultimately, the goal is to find a balance between relevance, language support, domain specificity, latency, and cost. By carefully evaluating these aspects, you can select an embedding model that not only meets the technical requirements of your application but also enhances its overall performance and user experience.’’
“But once you have selected an embedding model and you’re happy generating vector embeddings at scale, what do you do with them? And how does this fit into building GenAI apps?”
Don’t panic: Vector databases
Once the embeddings have been generated for the unstructured data, the next step is to store and manage these embeddings in a vector database. Vector databases are designed to handle the storage and retrieval of high-dimensional vectors, and they use advanced indexing techniques to ensure efficient operations.
When adding an embedding to the vector database, it indexes the vector to optimise for fast retrieval. This involves organising the vectors into a structure that allows efficient nearest-neighbour searches. This is, where similar vectors are clustered together, making quick access possible when searching.
When querying the database with a new embedding, from a user query for instance, the database performs a nearest neighbour search to find the most similar vectors. The results are then mapped back to the original content, providing relevant information to the user.
By leveraging vector databases, you can ensure that your GenAI applications are capable of handling complex queries and retrieving relevant information quickly and accurately.
Vectorize
“Wouldn’t it be nice if a vector database already knew what kind of embedding model was being used for an app?”
Astra Vectorize has entered the chat.
What an application developer, what will mainly need from a database is the ability to insert and update data, both structured and unstructured, and then retrieve it based on some inputs. Vectors and embeddings are an implementation detail of how the data is indexed and stored; manually building and maintaining these intermediate data structures isn’t where the focus should be.
Think of traditional databases. If there is a bunch of floating-point numbers representing the prices of products, simply create a Float or a Double field in one of your collections or tables. Strings representing descriptions of those products? It’s a Text field. It may need some indexing configuration along with the data, but it’s generally not required and comes with sensible defaults.
The goal of Astra Vectorize is to provide a similar workflow for unstructured data, enabling you to perform CRUD operations directly with this data rather than with vectors generated elsewhere. In addition, Vectorize comes with out-of-the-box access to the NV-Embed-QA model by NVIDIA, which works well for many use cases. This model is hosted by Astra DB itself, so you don’t have to create any additional accounts or configure any external services.