Universal geometry links multimodal embedding spaces

May 29, 2025

What if all the embedding spaces used in AI — text, images, speech, protein sequences — share a universal structure? What if we could translate embeddings across different models without requiring linked datasets? Sounds impossible, but this paper proves otherwise!

A research team at Cornell University introduces a new unsupervised method for translating text embeddings between vector spaces without relying on paired data, encoders, or predefined correspondences. Their approach is based on the idea that embedding spaces produced by different models exhibit a shared geometric structure.

vec2vec: embeddings from two unpaired, architecturally different models become nearly identical (source: paper)

The image above shows how two input embeddings from distinct model families (T5-based GTR and BERT-based GTE), which are initially incompatible due to their different architectures, are transformed into a shared latent space where they become closely aligned.

Embeddings: semantics in vector spaces

Embeddings are a fundamental concept in natural language processing, used to represent discrete data — such as words or image pixels — as continuous vectors in a high-dimensional space. These vectors enable a range of tasks, such as information retrieval, classification, and clustering, by capturing meaningful relationships between data points.

More than just numerical representations, embeddings encode semantic structure. Words with similar meanings tend to be located near each other, and objects sharing common attributes naturally form clusters within the vector space.

While various embedding models are trained on different datasets and configurations, ideally, embeddings of the same text should preserve its semantics across models. In practice, different models generate embeddings in entirely incompatible vector spaces, making cross-model interoperability a challenge.

The method

To address this challenge, the researchers developed vec2vec, an unsupervised method that maps embeddings from one model’s space to another using only the internal structure of the embedding spaces.

The next diagram illustrates how vec2vec translates embeddings from private documents (produced by Encoder A) into the embedding space of public documents (from Encoder B) using a shared latent space — enabling effective translation (bottom right) and even document recovery (bottom left).

The method is based on the Platonic Representation Hypothesis, which suggests that embedding spaces — whether learned from language, images, or biological data — are not random or entirely model-specific. Instead, they share underlying geometric structures that reflect a universal semantic organization. These patterns can seen as a universal grammar — not of syntax, but of meaning in high-dimensional space.

vec2vec uses unsupervised embedding translation to convert a document embeddings from an unknown space to a known space without ever seeing direct pairs — just by aligning the overall shape or distribution of the two spaces (see the picture below).

Unsupervised embedding translation (source: paper)

The input document di (left), which is never seen by vec2vec, is first encoded by encoder M1 into its native embedding space, producing an embedding vector ui=M1(di). vec2vec’s goal is to translate this embedding into the space of a second encoder, M2, approximating the embedding M2(di) without ever seeing di or its true embeddings from M2(di).

To achieve this, vec2vec generates a new embedding F(ui) that aims to closely match the embedding M2(di) that M2 would have produced for the same document. This process uses a learned mapping F to translate between embedding spaces in a zero-shot, unsupervised manner, without paired data or direct access to the original inputs.

Training

Each vec2vec model was trained to enable unsupervised translation of embeddings from one vector space to another while preserving their semantic and geometric structure. The research team selected two different sets of text data, each containing 1 million 64-token sequences sampled from the Natural Questions (NQ) dataset, with no overlap between the sets.

The embedding models used during training and evaluation are listed in the table below (granite is multilingual, while CLIP is multimodal).

Model	Params (M)	Backbone (the base architecture)	Year	Dims of the embedding vectors	Max sequence length
gtr	110	T5	2021	768	512
clip	151	CLIP	2021	512	77
e5	109	BERT	2022	768	512
gte	109	BERT	2023	768	512
stella	109	BERT	2023	768	512
granite	278	RoBERTa	2024	768	512

Embedding models used during experiments (source: paper)

The approach involves learning two mappings: one from a source embedding space to the universal latent space, and another from the universal latent space to the target embedding space. For example one vec2vec might learn to convert e5 embeddings to gte, while another handles gtr to stella, and so on, encompassing the full spectrum of e5, gte, gtr, stella, granite, and clip.

This is done in an unsupervised manner, without any explicit pairing or alignment. The training optimizes for high cosine similarity between translated embeddings and their true counterparts in the target space, ensuring semantic meaning is preserved.

Evaluation

They tested how much information could be extracted from translated text embeddings using 2 techniques: zero-shot attribute inference, which identifies the top k most similar attributes based on cosine similarity, and embedding inversion, which attempts to reconstruct the original text from the embedding.

The three heatmaps below compare the embedding similarity scores across different text embedding models, before and after alignment using vec2vec.

Heatmaps comparing embedding similarity scores (source: paper)

Similarity of Inputs (left panel) shows the original cosine similarity between embeddings from two different models, before any transformation. For example, e5 vs gte has a similarity of 0.68, while granite vs gtr is -0.02.
Similarity of latents (middle panel) shows the cosine similarity after alignment, using vec2vec. across models most values are now very high (e.g., 0.88 – 0.96).
Difference in similarities (right panel) shows the difference between the latent similarity and input similarity, where high values (red) mean big improvements in similarity after alignment.

The heatmaps show that vec2vec significantly increases similarity between semantically equivalent embeddings across different models.

Example of using vec2vec translations to extract information from an email

The researchers wanted to determine whether translating and inverting an email embedding (i.e., turning the vector back into text) could reveal any private or meaningful information from the original email, even without direct access to the original content.

They used GPT-4o as a judge. The model is shown both the original and the inverted email (the text reconstructed from the embedding). It is then asked to decide whether the inverted version leaks or reveals information about the original.

Here is the exact prompt they gave GPT-4o:

The figure below shows how often GPT-4o has found that meaningful information was leaked. This illustrates how much sensitive or semantic content remains in the embedding after translation.

Leakage of information via inversion (source: paper)

Security risks related to text input reconstruction

Being able to translate embeddings between spaces without having the original data can create security risks for vector databases. If someone gains access to stored embeddings, they may be able to recover sensitive information by translating and inverting those vectors, even if they never see the original text. This means private data such as medical records, personal queries, or proprietary documents could potentially be recovered from embeddings alone.

To mitigate this risk, embeddings should be treated as sensitive data. Measures may include encrypting vector databases, restricting access to tools that export embeddings, and preventing unauthorized translation between embedding models.

Conclusion

This research demonstrates that embeddings aren’t encoder-specific — they’re structured spaces where a universal geometry might underlie all learned representations.

The approach has major implications: it means that models can understand each other without direct supervision, and embeddings can be reused across tasks and domains more effectively than previously thought.