Most RAG systems have a context problem

Multi-Vector Retrieval Details with Mixedbread’s Aamir Shakir

Dec 16, 2025

Most RAG systems have a context problem. I talked with Aamir Shakir, the founder of Mixedbread, for a deep dive into the research and engineering behind modern retrieval systems. I’ve been using Mixedbread’s tools, particularly mgrep, and I’m impressed. They claim to cut tokens in half, speed up retrieval, and improve quality. After running my own experiments, I found they were right.

Our conversation went beyond “advanced RAG” topics of hybrid search and re-rankers. We explored the architectural shift of multi-vector retrieval.

This post is a summary of our discussion, covering the theory, the engineering challenges, and the practical applications of building a state-of-the-art retrieval system.

If you missed the first talk on using mgrep for agentic workflows, you can watch it here: mgrep with Founding Engineer Rui.

What is Mixedbread & Multi-Vector Search?

Mixedbread began as an applied research lab built on a simple hypothesis: AI will only be as useful as its context. An AI without context is like a new employee on their first day; they’re capable but not yet effective. This “context problem” is a search and retrieval problem.

While AI models have advanced at a breakneck pace, retrieval technology has largely relied on concepts from 20 years ago. Mixedbread’s goal is to modernize retrieval to match the capabilities of today’s AI.

The core of their approach is multi-vector search.

Traditional retrieval augmented generation (RAG) typically follows this path:

Take a document.
Split it into chunks.
Create one single vector embedding for each chunk.

This process involves extreme compression, forcing a lot of complex information into a single vector. Multi-vector search, particularly using models like ColBERT, changes this.

Instead of one vector per chunk, it creates one vector per token. For a sentence like “I love bread,” a traditional model produces one vector. A multi-vector model produces three: one for “I,” one for “love,” and one for “bread.” This preserves far more granular information.

Why Multi-Vector Outperforms Traditional RAG

The limitations of older methods highlight why a new approach is needed.

Keyword Search (BM25) anchors on exact keywords, making it robust for niche domains with specific terminology (like chemical names in biology). However, it fails with semantics, synonyms and abbreviations. (like ”RAG” vs. “retrieval augmented generation”), and context (”Apple” the fruit vs. “Apple” the company).
Single-Vector Search: compresses paragraphs into a single point. It captures the topic but blurs the nuance. When you compress a complex paragraph discussing politics, food, and sports into a single vector, you may only retain the main topic and lose the details. This approach is also sensitive to “out-of-distribution” data. If the model hasn’t seen a specific term or if OCR errors introduce strange characters, it will guess where to place the vector, losing all meaning.

Multi-vector search combines the best of both worlds.

Granularity: By representing every token, it captures the keyword-level precision of BM25, making it robust to out-of-distribution terms.
Semantics: Since each token’s representation is a dense vector, it also captures the semantic meaning and context, just like a traditional embedding.

This approach effectively provides a powerful hybrid search “out of the box.” Because it retains so much more information, it generalizes exceptionally well to new domains, complex data, and even long-context retrieval.

Aamir shared benchmarks showing their ColBERT-style model, trained only on documents with 300 tokens, outperforming models specifically designed for long-context retrieval on documents with tens of thousands of tokens.

To build a strong foundation in traditional RAG, including BM25 and semantic search, check out my course. All the content is free to access.

Making Multi-Vector Practical with Quantization

If multi-vector is so powerful, why wasn’t it the standard all along? The primary barriers were infrastructure and cost. Storing a vector for every token generates a massive amount of data, making it prohibitively expensive and slow without the right engineering.

This is where quantization becomes critical. Quantization is the process of converting high-precision numbers (like 32-bit floats) into lower-precision formats to save storage and speed up computation.

Aamir explained two common techniques:

Int8 Quantization: Instead of storing a 32-bit float for each dimension of a vector, you can represent it with an 8-bit integer. This involves finding the minimum and maximum value for each dimension across a sample of your data, then dividing that range into 256 “buckets.” Each float is then mapped to the integer of its corresponding bucket. This reduces storage by 4x and can speed up computation by 8-10x with virtually zero loss in retrieval quality.
Binary (1-bit) Quantization: This is a more extreme form of compression. For each dimension, you simply store a 1 if the value is positive and a 0 if it’s not. This reduces storage by 32x. Instead of calculating cosine similarity, you can use Hamming distance, which is incredibly fast (just two CPU cycles: XOR and popcount). However, this can lead to significant performance loss if the model isn’t optimized for it.

Mixedbread discovered a trick to mitigate the performance loss of binary quantization. If the stored document vectors are binary but the incoming query vector remains in a higher precision (like float32 or int8), the performance loss drops from as much as 40% down to just 5%. The model cares more about the dimensionality and the precision of the query than having both be low-precision.

Mixedbread and Hugging Face co-authored a post on this topic, showing how to achieve a 40x speedup and 62x cost reduction.
I also wrote a post that breaks down the fundamentals of quantization for multi-vector retrieval.

Mixedbread’s Architecture & Semantic Chunking

With these techniques, Mixedbread has built an end-to-end system designed for speed and scale. Indexing the entire React codebase (60 million tokens) takes just a couple of minutes.

Here’s an overview of their architecture:

Ingestion & Chunking: When a file is uploaded, it’s chunked based on semantics (more on this later)
Inference: The chunks are sent to a fleet of GPUs running a custom, highly-optimized inference engine with custom CUDA kernels. This allows for massive parallelization and low-latency embedding generation.
Storage & Caching: Embeddings are quantized and stored. The system uses a two-step retrieval process (a fast, lossy first pass followed by a full-precision second pass) and a multi-tier caching system that moves data from S3 to hard drives, NVMe SSDs, and finally to in-memory for frequently accessed data.

A query typically takes around 60 milliseconds end-to-end (P95).

Smart Chunking for Any Data Type

A key part of Mixedbread’s system is its sophisticated approach to parsing and chunking different file types, so users don’t need a Ph.D. in data processing.

Code: They parse the Abstract Syntax Tree (AST) to create semantically meaningful chunks, grouping related functions or classes together.
PDFs: PDFs are notoriously difficult to parse due to tables, columns, and charts. Mixedbread bypasses this by taking a screenshot of each page and creating an embedding from the image. This perfectly preserves the layout and content. They also use LLMs to create contextual summaries to link pages together.
Video: A transformer-based shot detection model analyzes frames to identify scene changes, creating logical chunks based on the visual narrative.
Text/Markdown: They use contextualization methods to ensure each chunk contains relevant surrounding information, a technique inspired by research from Anthropic.

This idea of processing entire documents and then chunking at the embedding level is sometimes called “late chunking.”

I wrote a post that covers the concept of late chunking with a minimal implementation.

The Role of Re-rankers & Cross-Encoders

Even with a strong retriever like ColBERT, a re-ranking step can provide a final boost in quality. Aamir confirmed they use cross-encoders internally.

A cross-encoder looks at both the query and a potential document simultaneously (pairwise comparison), allowing it to make a more accurate judgment of relevance than a retriever that embeds the document in isolation.

The next frontier is list-wise re-ranking, where the model is given the query and the entire list of candidate documents at once. This allows it to answer questions like “which is the fastest?” by comparing all options. However, this is currently too slow and expensive for most production systems.

Aamir is also excited about learnable scoring functions. Instead of burning GPUs to create complex embeddings only to compare them with a simple cosine similarity, the scoring function itself could be a learned model, further personalizing and improving relevance.

How to Get Started in Retrieval Research

Retrieval is a more accessible field to enter than training foundational LLMs. You can get started with just a MacBook. Aamir’s advice for anyone interested was:

Read the Fundamentals: Start with the original Sbert (Sentence-BERT) paper to understand the basics of modern embedding models.
Learn by Doing: Use libraries like sentence-transformers to train your own models. The documentation is excellent.
Read In-Depth Guides: The Mixedbread blog offers deep dives into their training techniques.
Stay Updated: Follow resources like the Information Retrieval Substack to keep up with the latest research.
Embrace the Struggle: The most important step is to build things yourself. Don’t rely on AI to write all the code. The learning happens when you’re debugging PyTorch and CUDA errors.

Conclusion

Multi-vector search is heavier and harder to engineer than standard RAG. But with quantization making it affordable, the quality gains are finally accessible. If you are hitting a ceiling with semantic search, this is the architecture to investigate next. Mixedbread is an API that does it for you.

My conversation with Aamir reinforced that the quality of our AI systems will always be tied to the quality of the context we provide. As models get smarter, the tools we use to feed them information must get smarter too.

If you’re working on complex retrieval problems, the techniques discussed here are the new baseline. And if you’re an engineer passionate about building high-performance, distributed systems, Mixedbread is hiring.

Explore their work and open positions at mixedbread.com.

Neural Foundry

This deep dive on multi-vector retrieval is exactly what the field needs right now. The token level granularity point really hits home becasue single-vector compression basically throws away all the detail that makes retreival actually useful. I've seen so many RAG systems fail on edge cases and this explains why, the semantic drift from compression is way worse than people realize and quantization making this practical changes everything.

Expand full comment

Elite AI Assisted Coding

Discussion about this post

Ready for more?