Deep dive into bi-encoder and cross-encoder architectures for semantic similarity. Learn the trade-offs, implementation patterns, and when to use each approach in RAG systems and search applications.
When building semantic search, RAG systems, or recommendation engines, one architectural decision will fundamentally shape your system's performance: should you use bi-encoders or cross-encoders? The answer isn't straightforward—each architecture makes different trade-offs between speed and accuracy that matter enormously at scale.
Bi-encoders can search through millions of documents in milliseconds but may miss nuanced relevance. Cross-encoders capture subtle semantic relationships with remarkable accuracy but can't scale beyond a few hundred comparisons per query. Understanding when to use each—and how to combine them—is essential for building production-grade semantic systems.
This guide explains both architectures from first principles, compares their characteristics, and shows you how to implement the two-stage retrieval pattern that powers modern search systems at companies like Google, Microsoft, and OpenAI.
Traditional keyword search fails when users express the same concept differently. A search for "how to fix a slow laptop" won't match a document titled "Speed up your computer performance" despite being semantically identical. Semantic search solves this by comparing meaning rather than words.
But here's the challenge: to find semantically similar documents, you need to compare your query against every document in your corpus. With millions of documents, this becomes computationally intractable—unless you're clever about how you structure the comparison.
This is where bi-encoders and cross-encoders diverge. They represent two fundamentally different approaches to the same problem:
"Encode everything once, compare embeddings fast"
"Consider query and document together for precision"
A bi-encoder uses two separate transformer encoders (or the same encoder applied twice) to independently convert queries and documents into fixed-size embedding vectors. These embeddings exist in a shared semantic space where similar meanings cluster together.
Query: "laptop running slow" Document: "Speed up your computer"
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Transformer │ │ Transformer │
│ Encoder │ │ Encoder │
└────────┬────────┘ └────────┬────────┘
│ │
▼ ▼
[0.23, -0.45, 0.12, ...] [0.21, -0.42, 0.15, ...]
Query Embedding Document Embedding
│ │
└──────────────┬───────────────────────┘
│
▼
Cosine Similarity
0.94
The key insight is that document embeddings can be computed once and stored. When a query arrives, you only need to:
With optimised libraries like FAISS or vector databases like Pinecone, you can compare against billions of vectors in under 100 milliseconds. This is why bi-encoders dominate large-scale retrieval.
1from sentence_transformers import SentenceTransformer
2import numpy as np
3
4# Load a bi-encoder model
5model = SentenceTransformer('all-MiniLM-L6-v2')
6
7# Pre-compute document embeddings (do this once)
8documents = [
9 "Speed up your computer performance with these tips",
10 "Best practices for Python code optimization",
11 "How to troubleshoot network connectivity issues",
12 "Machine learning model deployment strategies",
13]
14
15# Encode all documents - these embeddings are stored/cached
16doc_embeddings = model.encode(documents, convert_to_numpy=True)
17
18# At query time, encode the query and compare
19query = "laptop running slow"
20query_embedding = model.encode(query, convert_to_numpy=True)
21
22# Compute cosine similarities
23similarities = np.dot(doc_embeddings, query_embedding) / (
24 np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
25)
26
27# Get top results
28top_indices = np.argsort(similarities)[::-1]
29for idx in top_indices[:3]:
30 print(f"Score: {similarities[idx]:.3f} | {documents[idx]}")Because bi-encoders compress documents into fixed-size vectors (typically 384-768 dimensions), information is necessarily lost. The quality of this compression depends on:
A bi-encoder must compress an entire document (potentially thousands of words) into a single vector of a few hundred numbers. This works well for capturing general topic similarity but can miss specific details that matter for relevance. A document about "Python code optimisation" and "Python snake habitats" might have more similar embeddings than you'd expect because "Python" dominates both.
Cross-encoders take a fundamentally different approach. Instead of encoding query and document separately, they process both together as a single input sequence. This allows full attention between query tokens and document tokens—the model can directly compare every word in the query against every word in the document.
Query: "laptop running slow" Document: "Speed up your computer"
│ │
└──────────┬───────────────┘
│
▼
[CLS] laptop running slow [SEP] Speed up your computer [SEP]
│
▼
┌─────────────────────┐
│ Transformer │
│ (Full Attention │
│ Between All │
│ Tokens) │
└──────────┬──────────┘
│
▼
┌─────────────────┐
│ Classification │
│ Head │
└────────┬────────┘
│
▼
Relevance Score: 0.89
The key advantage is cross-attention. When processing the combined input, the transformer can:
This produces more nuanced relevance judgments. Cross-encoders consistently outperform bi-encoders on relevance benchmarks, often by significant margins (5-15% improvement in metrics like NDCG@10).
1from sentence_transformers import CrossEncoder
2
3# Load a cross-encoder model trained for relevance ranking
4model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
5
6# Score query-document pairs
7query = "laptop running slow"
8documents = [
9 "Speed up your computer performance with these tips",
10 "Best practices for Python code optimization",
11 "How to troubleshoot network connectivity issues",
12 "Machine learning model deployment strategies",
13]
14
15# Create query-document pairs
16pairs = [[query, doc] for doc in documents]
17
18# Score all pairs (returns relevance scores)
19scores = model.predict(pairs)
20
21# Sort by score
22ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
23for doc, score in ranked:
24 print(f"Score: {score:.3f} | {doc}")Cross-encoders have a fatal flaw for large-scale retrieval: you cannot pre-compute anything. Every query requires processing the query with every document through the full transformer. For a corpus of 1 million documents:
This is obviously impractical for real-time search.
Cross-encoders can realistically only score hundreds to low thousands of candidates per query. This limitation is fundamental to the architecture—there's no way around it without sacrificing the cross-attention that makes them accurate.
Let's compare both architectures across the dimensions that matter for production systems:
| Characteristic | Bi-Encoder | Cross-Encoder |
|---|---|---|
| Speed at Query Time | Very fast (milliseconds) | Slow (scales linearly with corpus) |
| Relevance Accuracy | Good | Excellent |
| Scalability | Billions of documents | Hundreds per query |
| Pre-computation | Yes - encode docs once | No - must process each query |
| Index Updates | Add new embeddings easily | No index needed |
| Memory for Corpus | Embedding storage required | Just document text |
| GPU Requirements | Query-time only (optional) | Required for reasonable speed |
| Best For | First-stage retrieval, large corpora | Reranking, high-stakes decisions |
The performance difference isn't marginal. On standard benchmarks like MS MARCO:
Bi-Encoder (all-MiniLM-L6-v2)
Cross-Encoder (ms-marco-MiniLM-L-6-v2)
The cross-encoder achieves roughly 18% better ranking quality, but at a cost that makes it unusable for first-stage retrieval at scale.
The solution used by virtually every production semantic search system is two-stage retrieval: use a bi-encoder to quickly retrieve candidates, then use a cross-encoder to precisely rerank the top results.
User Query
│
▼
┌───────────────────────────────┐
│ Stage 1: Retrieval │
│ (Bi-Encoder) │
│ │
│ • Encode query (~20ms) │
│ • Search vector index (~50ms) │
│ • Return top 100 candidates │
└───────────────┬───────────────┘
│
Top 100 docs
│
▼
┌───────────────────────────────┐
│ Stage 2: Reranking │
│ (Cross-Encoder) │
│ │
│ • Score 100 pairs (~500ms) │
│ • Sort by relevance │
│ • Return top 10 │
└───────────────┬───────────────┘
│
▼
Final Results
This approach captures most of the cross-encoder's accuracy improvement while maintaining millisecond-scale latency. The bi-encoder's job is recall (don't miss relevant documents), while the cross-encoder's job is precision (rank the relevant ones correctly).
1from sentence_transformers import SentenceTransformer, CrossEncoder
2import numpy as np
3from typing import List, Tuple
4
5class TwoStageRetriever:
6 def __init__(
7 self,
8 bi_encoder_model: str = 'all-MiniLM-L6-v2',
9 cross_encoder_model: str = 'cross-encoder/ms-marco-MiniLM-L-6-v2',
10 top_k_retrieval: int = 100,
11 top_k_rerank: int = 10
12 ):
13 self.bi_encoder = SentenceTransformer(bi_encoder_model)
14 self.cross_encoder = CrossEncoder(cross_encoder_model)
15 self.top_k_retrieval = top_k_retrieval
16 self.top_k_rerank = top_k_rerank
17
18 self.documents: List[str] = []
19 self.doc_embeddings: np.ndarray = None
20
21 def index_documents(self, documents: List[str]) -> None:
22 """Pre-compute and store document embeddings."""
23 self.documents = documents
24 self.doc_embeddings = self.bi_encoder.encode(
25 documents,
26 convert_to_numpy=True,
27 show_progress_bar=True
28 )
29 # Normalize for cosine similarity
30 self.doc_embeddings = self.doc_embeddings / np.linalg.norm(
31 self.doc_embeddings, axis=1, keepdims=True
32 )
33
34 def search(self, query: str) -> List[Tuple[str, float]]:
35 """Two-stage search: retrieve then rerank."""
36 # Stage 1: Bi-encoder retrieval
37 query_embedding = self.bi_encoder.encode(query, convert_to_numpy=True)
38 query_embedding = query_embedding / np.linalg.norm(query_embedding)
39
40 # Compute similarities (dot product of normalized vectors = cosine)
41 similarities = np.dot(self.doc_embeddings, query_embedding)
42
43 # Get top-k candidates
44 top_indices = np.argsort(similarities)[::-1][:self.top_k_retrieval]
45 candidates = [self.documents[i] for i in top_indices]
46
47 # Stage 2: Cross-encoder reranking
48 pairs = [[query, doc] for doc in candidates]
49 rerank_scores = self.cross_encoder.predict(pairs)
50
51 # Sort by rerank scores
52 reranked = sorted(
53 zip(candidates, rerank_scores),
54 key=lambda x: x[1],
55 reverse=True
56 )
57
58 return reranked[:self.top_k_rerank]
59
60
61# Usage example
62retriever = TwoStageRetriever()
63
64# Index your documents (do once)
65documents = [
66 "How to improve laptop performance and speed",
67 "Python programming best practices guide",
68 "Troubleshooting slow computer issues",
69 "Machine learning model optimization techniques",
70 "Windows performance tuning tips",
71 # ... thousands more documents
72]
73retriever.index_documents(documents)
74
75# Search (fast, accurate)
76results = retriever.search("my laptop is running slowly")
77for doc, score in results:
78 print(f"{score:.3f}: {doc}")The key parameters to tune are:
For a 200ms total latency budget:
This leaves headroom for network latency and allows reranking 100 candidates while staying responsive.
Understanding when each architecture shines helps you make the right choice for your specific application.
For customer-facing search (e-commerce, documentation, knowledge bases), the two-stage pattern is essential. Users expect sub-second responses, but also expect relevant results.
For RAG systems, the quality of retrieved context directly impacts the quality of generated responses. Cross-encoder reranking is particularly valuable here.
Finding duplicate or near-duplicate content across large document sets. Here, bi-encoders often suffice because you're looking for high similarity rather than subtle relevance.
Content-based recommendations using semantic similarity. Bi-encoders excel here because you need to compare user preferences against large item catalogs in real-time.
Not every use case needs two-stage retrieval. Consider bi-encoder only when:
Use this decision framework to select the right approach for your use case:
Corpus size > 10,000 documents?
Yes → You need bi-encoder for first-stage retrieval
Real-time latency requirements (< 500ms)?
Yes → Two-stage with limited reranking candidates
High-stakes decisions (legal, medical, financial)?
Yes → Definitely add cross-encoder reranking
Simple similarity matching (duplicates, recommendations)?
Maybe bi-encoder alone is sufficient
Choosing the right pre-trained models significantly impacts performance:
| Use Case | Recommended Bi-Encoder | Recommended Cross-Encoder |
|---|---|---|
| General English | all-MiniLM-L6-v2 | cross-encoder/ms-marco-MiniLM-L-6-v2 |
| Higher Accuracy | all-mpnet-base-v2 | cross-encoder/ms-marco-electra-base |
| Multilingual | paraphrase-multilingual-MiniLM-L12-v2 | cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 |
| Long Documents | BAAI/bge-large-en-v1.5 | BAAI/bge-reranker-large |
| Maximum Quality | intfloat/e5-large-v2 | cross-encoder/stsb-roberta-large |
Pre-trained models work well for general use, but fine-tuning on your domain data typically improves performance by 10-20%. This is especially true for specialised domains like legal, medical, or technical content. Both bi-encoders and cross-encoders can be fine-tuned using contrastive learning on query-document pairs.
In production, you'll typically store bi-encoder embeddings in a vector database. Here's how the pattern works with popular options:
1import pinecone
2from sentence_transformers import SentenceTransformer, CrossEncoder
3
4# Initialize
5pinecone.init(api_key="your-api-key", environment="your-env")
6index = pinecone.Index("semantic-search")
7
8bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
9cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
10
11def two_stage_search(query: str, top_k: int = 10) -> list:
12 # Stage 1: Vector search with Pinecone
13 query_embedding = bi_encoder.encode(query).tolist()
14
15 results = index.query(
16 vector=query_embedding,
17 top_k=100, # Retrieve more for reranking
18 include_metadata=True
19 )
20
21 # Extract documents from results
22 candidates = [
23 (match.id, match.metadata['text'])
24 for match in results.matches
25 ]
26
27 # Stage 2: Cross-encoder reranking
28 pairs = [[query, doc] for _, doc in candidates]
29 scores = cross_encoder.predict(pairs)
30
31 # Combine IDs with reranked scores
32 reranked = sorted(
33 zip([c[0] for c in candidates], [c[1] for c in candidates], scores),
34 key=lambda x: x[2],
35 reverse=True
36 )
37
38 return reranked[:top_k]The same pattern works with other vector databases like Qdrant, Weaviate, Milvus, or pgvector. The bi-encoder handles the initial retrieval from the vector index, and the cross-encoder refines the ranking.
Many production systems combine semantic search with traditional keyword matching for even better results:
1from rank_bm25 import BM25Okapi
2import numpy as np
3
4class HybridRetriever:
5 def __init__(self, documents: list[str]):
6 # BM25 for keyword matching
7 tokenized = [doc.lower().split() for doc in documents]
8 self.bm25 = BM25Okapi(tokenized)
9
10 # Bi-encoder for semantic matching
11 self.bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
12 self.doc_embeddings = self.bi_encoder.encode(documents)
13
14 # Cross-encoder for reranking
15 self.cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
16
17 self.documents = documents
18
19 def search(self, query: str, top_k: int = 10, alpha: float = 0.5):
20 # BM25 scores
21 bm25_scores = self.bm25.get_scores(query.lower().split())
22 bm25_scores = bm25_scores / (bm25_scores.max() + 1e-6) # Normalize
23
24 # Semantic scores
25 query_emb = self.bi_encoder.encode(query)
26 semantic_scores = np.dot(self.doc_embeddings, query_emb)
27 semantic_scores = (semantic_scores - semantic_scores.min()) / (
28 semantic_scores.max() - semantic_scores.min() + 1e-6
29 )
30
31 # Combine scores
32 hybrid_scores = alpha * semantic_scores + (1 - alpha) * bm25_scores
33
34 # Get top candidates for reranking
35 top_indices = np.argsort(hybrid_scores)[::-1][:100]
36 candidates = [self.documents[i] for i in top_indices]
37
38 # Cross-encoder reranking
39 pairs = [[query, doc] for doc in candidates]
40 rerank_scores = self.cross_encoder.predict(pairs)
41
42 reranked = sorted(
43 zip(candidates, rerank_scores),
44 key=lambda x: x[1],
45 reverse=True
46 )
47
48 return reranked[:top_k]Bi-encoders and cross-encoders represent two points on the speed-accuracy trade-off curve. Bi-encoders enable searching billions of documents in milliseconds through pre-computed embeddings, while cross-encoders capture nuanced semantic relationships with superior accuracy but can only process hundreds of pairs per query.
For most production systems, the answer isn't choosing one or the other—it's combining them. The two-stage retrieval pattern (bi-encoder retrieval followed by cross-encoder reranking) has become the de facto standard because it captures most of the cross-encoder's accuracy benefits while maintaining real-time latency.
As you implement semantic search, RAG systems, or recommendation engines, start with the two-stage pattern. Tune the number of candidates retrieved and reranked based on your latency budget and accuracy requirements. And remember that fine-tuning on your specific domain data often provides the biggest performance gains of all.
Discover how vector databases enable semantic search, power RAG systems, and revolutionize how AI accesses information. Complete guide to embeddings, similarity search, and choosing the right vector database.
Learn how RAG combines the power of large language models with your business data to provide accurate, contextual AI responses. Complete guide to understanding and implementing RAG systems.
Build intelligent search systems with knowledge graphs. Learn graph database selection, ontology design, entity extraction, and RAG integration with production code examples.