Jina AI introduces world’s first open-source 8K text embedding model to rival OpenAI

October 27, 2023

Berlin-based AI company Jina AI has launched its second-generation text embedding model, jina-embeddings-v2. This is the world’s first open-source model that can handle an impressive 8K (8192 tokens) context length.

This matches the capabilities and performance of OpenAI’s private model, text-embedding-ada-002, on the Massive Text Embedding Benchmark (MTEB) leaderboard.

The CEO of Jina AI, Dr. Han Xiao, stated that the launch of jina-embeddings-v2 is part of the company’s mission to “democratize AI and empower the community with tools that were once confined to proprietary ecosystems.”

The Jina AI team spent three months on intensive research and development, data collection, and tuning to develop the new model. It is available for download on Hugging Face in two versions:

Base model (0.27G) for heavy-duty tasks requiring higher accuracy
Small model (0.07G) for lightweight applications or devices with limited computing resources

What is text embedding?

Text embedding is a key technique in natural language processing (NLP) that converts text into numerical vectors. It can be used for various NLP tasks, such as machine translation, text classification, and question answering.

The longer the context length of a text embedding model, the more information it can capture about the text, leading to better performance on these tasks. Text embedding models are essential for building intelligent applications that can understand and manipulate natural language, such as chatbots, search engines, and voice assistants.

Why does 8K context length matter?

The 8K context length allows jina-embeddings-v2 to capture more information and nuances from long texts (legal documents, scientific papers, literary works, financial reports, and conversational queries).

For example, the model can better understand the scientific concepts and arguments in a paper, the legal terms and clauses in a contract, the plot and characters in a novel, the trends and insights in a report, and the intents and emotions in a query. This leads to higher quality and more realistic text embeddings that can be used for various downstream tasks.

How does jina-embeddings-v2 compare with other models?

The new model achieves a high performance level that rivals OpenAI’s proprietary model, text-embedding-ada-002, on the Massive Text Embedding Benchmark (MTEB) leaderboard (a comprehensive evaluation platform that measures the quality of text embeddings on 56 datasets covering various domains and tasks).

Jina-embeddings-v2 outperformed other leading base embedding models on several datasets when given extended context, highlighting the benefits of using longer context (see the figure below).

Jina-embeddings-v2 compared with other leading base embedding models (source: Jina AI)

What are the potential applications of jina-embeddings-v2?

Jina-embeddings-v2 can be used in a wide range of NLP applications, including machine translation, text classification, question answering, code generation, legal and medical document processing, and financial forecasting.

What’s next for Jina AI?

The company plans to:

Publish an academic paper detailing the technical intricacies and benchmarks of jina-embeddings-v2
Develop an OpenAI-like embeddings API platform
Launch German-English multilingual embedding models

Conclusion

Jina-embeddings-v2 is the world’s first open-source 8K text embedding model, delivering performance that rivals the best proprietary models. This will enable developers to build more powerful and accurate NLP applications.