In the wake of ChatGPT’s success in late 2022, OpenAI unveiled yet another revolutionary development with its text-embedding-ada-002 model, revolutionizing natural language information retrieval.
Now, elevating their innovation, OpenAI introduces its New Embedding Models: text-embedding-3-small and text-embedding-3-large.
These models represent not mere improvements, but significant advancements in AI, enhancing OpenAI’s capability to process and understand language. This marks the beginning of a new chapter in natural language processing.
In this article, we’ll explore the capabilities of these OpenAI New Embedding Models and their potential to redefine the AI landscape.
1 What Are Embeddings ?
In simple terms, an embedding is a sequence of numbers, but it’s so much more than just digits. These sequences represent concepts within content, such as the natural language we use daily or even code.
Imagine having a complex idea, and then being able to translate that into a language that a machine learning model or algorithm can understand. That’s exactly what embeddings do!
2 Why Are Embeddings Crucial?
Embeddings are not just a neat trick; they play a pivotal role in machine learning. They enable these smart models and algorithms to comprehend relationships between different content types.
This understanding is critical for tasks like clustering, where we group similar items, or retrieval, where we fetch relevant data based on a query.
Think of embeddings as a bridge. On one side, we have human language or code, intricate and nuanced. On the other side, we have the analytical, data-driven world of machine learning.
Embeddings form a bridge between these two, making it possible for one to interact with the other seamlessly.
3 Embeddings in Action: ChatGPT, Assistants API, and RAG Tools
Now, let’s see where all this comes into play in our everyday tools and applications. Embeddings are the powerhouse behind applications like knowledge retrieval in ChatGPT and the Assistants API.
These are the tools we often use to fetch information or automate responses. Additionally, they’re a big deal in Retrieval Augmented Generation (RAG) developer tools.
For instance, when you interact with ChatGPT, asking it to generate a piece of code or explain a concept, embeddings are working in the background. They help the model understand your query and pull relevant information to craft a response that makes sense.
In a nutshell, embeddings are the unsung heroes in the background, enabling our favorite AI tools to understand and interact with our world.
As developers, having a grasp on this concept opens up a plethora of possibilities in creating more intuitive and intelligent applications.
4 Introducing text-embedding-3-small and text-embedding-3-large
OpenAI has recently introduced two groundbreaking models: the text-embedding-3-small and the text-embedding-3-large.
These models are designed to transform the way we approach machine learning tasks, offering both efficiency and power.
text-embedding-3-small: The Efficient Model
The text-embedding-3-small is a game changer for efficiency-focused applications. It’s not just an incremental update; it’s a significant leap over its predecessor, text-embedding-ada-002, released back in December 2022.
Enhanced Performance
- Multi-language Retrieval: On the MIRACL benchmark, we’ve seen an impressive jump from 31.4% to a robust 44.0%.
- English Tasks: In the MTEB benchmark, performance has risen from 61.0% to 62.3%.
Reduced Pricing
What’s even more exciting is the cost-effectiveness. The pricing for text-embedding-3-small has been slashed by 5X compared to text-embedding-ada-002. This means you now pay just $0.00002 per 1k tokens, down from $0.0001.
Continued Support for text-embedding-ada-002
For those who are still using the ada-002 model, there’s no need to worry. OpenAI isn’t deprecating it, so you can continue using it, although the newer model is recommended.
text-embedding-3-large: The Powerful Model
Now, let’s shift gears to the more robust sibling, the text-embedding-3-large. This model is the next generation of embedding technology, boasting up to 3072 dimensions.
Superior Performance
- On MIRACL: It’s not just an improvement; it’s a leap, with scores soaring from 31.4% to an impressive 54.9%.
- On MTEB: Similarly, it outshines its predecessors with scores climbing from 61.0% to 64.6%.
Pricing
Despite its advanced capabilities, the pricing remains accessible at $0.00013 per 1k tokens.
5 Advanced Features and Flexibility
One of the most innovative aspects of these new models is their flexibility. Developers, you can now tailor the performance and cost of using embeddings to your specific needs.
Shortening Embeddings Without Losing Quality
- The Technique: You can shorten the length of embeddings from these models without losing their concept-representing properties. This is achieved by using the dimensions API parameter.
- Example: On the MTEB benchmark, a shortened text-embedding-3-large embedding (down to 256 dimensions) still outperforms an unshortened text-embedding-ada-002 embedding at 1536 dimensions.
Trade-off Between Performance and Cost
This ability to shorten embeddings means you can make strategic decisions about performance versus cost. For instance, if you’re using a vector data store that supports up to 1024 dimensions,
you can still use the powerful text-embedding-3-large model by specifying 1024 dimensions, effectively balancing accuracy with vector size requirements.
6 How To Use The New Embedding Models
Step-by-Step Guide for Developers
Step 1: Understanding the Models
Before diving into the practical use, it’s essential to understand what these new models, text-embedding-3-small and text-embedding-3-large, offer:
- text-embedding-3-small: Optimized for efficiency, suitable for applications where cost and speed are critical.
- text-embedding-3-large: Designed for superior performance, ideal for complex tasks requiring detailed embeddings.
Step 2: Choosing the Right Model
- Assess your project requirements.
- For general purposes and cost-effectiveness, go for text-embedding-3-small.
- For tasks demanding higher accuracy and detail, choose text-embedding-3-large.
Step 3: Preparing Your Environment
- Ensure you have Python installed on your system.
- Install the OpenAI Python package:
pip install openai
Step 4: Generating Embeddings
- Import the OpenAI library in your Python script.
- Set up your OpenAI client with your API key.
from openai import OpenAI
client = OpenAI(api_key="your-api-key")
Step 5: Sending Text for Embedding
- Choose your text data that needs embedding.
- Call the embedding function with your text and the chosen model.
response = client.embeddings.create(
input="Your text string goes here",
model="text-embedding-3-small" # or "text-embedding-3-large"
)
Step 6: Handling the Response
- The response will contain an embedding vector.
- This vector is your original text, transformed into a numerical format that machine learning models can interpret.
- Extract and save this embedding for your application’s use.
Step 7: Adjusting Embedding Size (Optional)
- If needed, you can adjust the size of the embeddings for the
text-embedding-3-large
model. - Use the
dimensions
parameter to reduce the vector size, trading some accuracy for reduced storage.
response = client.embeddings.create(
input="Your text string",
model="text-embedding-3-large",
dimensions=1024 # Reducing from 3072 to 1024
)
Step 8: Applying Embeddings in Your Application
- Use these embeddings for various tasks like search, clustering, or as features in machine learning models.
- They can be used in recommendation systems, anomaly detection, or as a base for more complex AI applications.
Tips for Efficient Usage
- Experiment with both models to see which fits your use case better.
- Remember that the
text-embedding-3-small
is more cost-effective, whiletext-embedding-3-large
offers more detailed embeddings.
How To Determine the Number of Tokens in Your Text Before Embedding?
To accurately count the number of tokens in your text, use OpenAI’s tokenizer, tiktoken. In Python, simply import tiktoken and use it to encode your string. This method will return the exact number of tokens present in your text. For instance:
import tiktoken
def num_tokens_from_string(string: str, encoding_name: str) -> int:
"""Returns the number of tokens in a text string."""
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens
token_count = num_tokens_from_string("Your text here", "cl100k_base")
By following these steps, you can effectively utilize OpenAI’s new embedding models in your projects.
Whether you’re enhancing a search engine, clustering data, or building complex AI applications, these embeddings provide a powerful tool to transform text into actionable insights.
7 Conclusion
In summary, OpenAI’s introduction of the text-embedding-3-small and text-embedding-3-large models marks a significant milestone in the field of natural language processing.
These models not only enhance the efficiency and depth of machine learning applications but also open new avenues for developers to innovate and excel.
Whether you’re optimizing search functionalities, clustering data, or building sophisticated AI systems, these embeddings serve as a robust foundation to translate complex human language into actionable, machine-readable insights.
Read More :Langchain Text Embedding Models (Complete List)
8 Frequently Asked Questions on Using OpenAI’s Text Embeddings :
What Is the Best Way to Quickly Retrieve K Nearest Embedding Vectors?
For rapid retrieval of K nearest embedding vectors, utilizing a vector database is recommended. This approach is particularly efficient when working with a large number of vectors. OpenAI’s Cookbook on GitHub offers practical examples of integrating vector databases with the OpenAI API for smooth operations.
Which Distance Function Should I Use with Embeddings?
Cosine similarity is generally the preferred distance function when working with OpenAI embeddings. Since these embeddings are normalized to a length of 1, the cosine similarity can be computed quickly using a dot product. Moreover, both cosine similarity and Euclidean distance will yield identical rankings, making cosine similarity a reliable and efficient choice.
Can I Share My Embeddings Online?
Yes, you are free to share your embeddings online. OpenAI allows customers to own their input and output, including embeddings. However, it’s crucial to ensure that the content you input into the API and the embeddings you share do not violate any applicable laws or OpenAI’s Terms of Use.
Discussion about this post