LearnTechnical Deep DivesBi-Encoders vs Cross-Encoders: Choosing the Right Architecture for Semantic Search
intermediate
15 min read
January 28, 2025

Bi-Encoders vs Cross-Encoders: Choosing the Right Architecture for Semantic Search

Deep dive into bi-encoder and cross-encoder architectures for semantic similarity. Learn the trade-offs, implementation patterns, and when to use each approach in RAG systems and search applications.

Clever Ops Team

When building semantic search, RAG systems, or recommendation engines, one architectural decision will fundamentally shape your system's performance: should you use bi-encoders or cross-encoders? The answer isn't straightforward—each architecture makes different trade-offs between speed and accuracy that matter enormously at scale.

Bi-encoders can search through millions of documents in milliseconds but may miss nuanced relevance. Cross-encoders capture subtle semantic relationships with remarkable accuracy but can't scale beyond a few hundred comparisons per query. Understanding when to use each—and how to combine them—is essential for building production-grade semantic systems.

This guide explains both architectures from first principles, compares their characteristics, and shows you how to implement the two-stage retrieval pattern that powers modern search systems at companies like Google, Microsoft, and OpenAI.

Key Takeaways

  • Bi-encoders pre-compute embeddings for millisecond search across millions of documents
  • Cross-encoders process query-document pairs together for higher accuracy but cannot scale
  • Two-stage retrieval (bi-encoder retrieve, cross-encoder rerank) is the production standard
  • Retrieve 50-200 candidates with bi-encoder, rerank top 10-20 with cross-encoder
  • Fine-tuning on domain data typically improves both architectures by 10-20%
  • Choose pre-trained models based on your language, accuracy needs, and latency budget
  • Hybrid search combining BM25 keyword matching with semantic search often outperforms either alone

The Core Problem: Semantic Similarity at Scale

Traditional keyword search fails when users express the same concept differently. A search for "how to fix a slow laptop" won't match a document titled "Speed up your computer performance" despite being semantically identical. Semantic search solves this by comparing meaning rather than words.

But here's the challenge: to find semantically similar documents, you need to compare your query against every document in your corpus. With millions of documents, this becomes computationally intractable—unless you're clever about how you structure the comparison.

This is where bi-encoders and cross-encoders diverge. They represent two fundamentally different approaches to the same problem:

Bi-Encoder Approach

"Encode everything once, compare embeddings fast"

  • • Pre-compute document embeddings
  • • Store in vector database
  • • Compare query embedding to all docs
  • • Millisecond retrieval at any scale

Cross-Encoder Approach

"Consider query and document together for precision"

  • • Process query+document pairs
  • • Full attention between all tokens
  • • More accurate relevance scores
  • • Can only score a few hundred pairs

How Bi-Encoders Work

A bi-encoder uses two separate transformer encoders (or the same encoder applied twice) to independently convert queries and documents into fixed-size embedding vectors. These embeddings exist in a shared semantic space where similar meanings cluster together.

Bi-Encoder Architecture

    Query: "laptop running slow"          Document: "Speed up your computer"
              │                                      │
              ▼                                      ▼
    ┌─────────────────┐                   ┌─────────────────┐
    │   Transformer   │                   │   Transformer   │
    │    Encoder      │                   │    Encoder      │
    └────────┬────────┘                   └────────┬────────┘
              │                                      │
              ▼                                      ▼
    [0.23, -0.45, 0.12, ...]             [0.21, -0.42, 0.15, ...]
         Query Embedding                    Document Embedding
              │                                      │
              └──────────────┬───────────────────────┘
                             │
                             ▼
                    Cosine Similarity
                         0.94
            

The Pre-Computation Advantage

The key insight is that document embeddings can be computed once and stored. When a query arrives, you only need to:

  1. 1. Encode the query — One forward pass through the transformer (~10-50ms)
  2. 2. Compare against all documents — Vector similarity operations are extremely fast

With optimised libraries like FAISS or vector databases like Pinecone, you can compare against billions of vectors in under 100 milliseconds. This is why bi-encoders dominate large-scale retrieval.

Bi-Encoder with Sentence-Transformerspython
1from sentence_transformers import SentenceTransformer
2import numpy as np
3
4# Load a bi-encoder model
5model = SentenceTransformer('all-MiniLM-L6-v2')
6
7# Pre-compute document embeddings (do this once)
8documents = [
9    "Speed up your computer performance with these tips",
10    "Best practices for Python code optimization",
11    "How to troubleshoot network connectivity issues",
12    "Machine learning model deployment strategies",
13]
14
15# Encode all documents - these embeddings are stored/cached
16doc_embeddings = model.encode(documents, convert_to_numpy=True)
17
18# At query time, encode the query and compare
19query = "laptop running slow"
20query_embedding = model.encode(query, convert_to_numpy=True)
21
22# Compute cosine similarities
23similarities = np.dot(doc_embeddings, query_embedding) / (
24    np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
25)
26
27# Get top results
28top_indices = np.argsort(similarities)[::-1]
29for idx in top_indices[:3]:
30    print(f"Score: {similarities[idx]:.3f} | {documents[idx]}")

Embedding Quality Matters

Because bi-encoders compress documents into fixed-size vectors (typically 384-768 dimensions), information is necessarily lost. The quality of this compression depends on:

  • Model architecture: Larger models capture more nuance
  • Training data: Models trained on your domain perform better
  • Embedding dimension: Higher dimensions preserve more information but cost more to store and compare

The Compression Trade-off

A bi-encoder must compress an entire document (potentially thousands of words) into a single vector of a few hundred numbers. This works well for capturing general topic similarity but can miss specific details that matter for relevance. A document about "Python code optimisation" and "Python snake habitats" might have more similar embeddings than you'd expect because "Python" dominates both.

How Cross-Encoders Work

Cross-encoders take a fundamentally different approach. Instead of encoding query and document separately, they process both together as a single input sequence. This allows full attention between query tokens and document tokens—the model can directly compare every word in the query against every word in the document.

Cross-Encoder Architecture

    Query: "laptop running slow"    Document: "Speed up your computer"
                    │                          │
                    └──────────┬───────────────┘
                               │
                               ▼
            [CLS] laptop running slow [SEP] Speed up your computer [SEP]
                               │
                               ▼
                    ┌─────────────────────┐
                    │    Transformer      │
                    │  (Full Attention    │
                    │   Between All       │
                    │     Tokens)         │
                    └──────────┬──────────┘
                               │
                               ▼
                    ┌─────────────────┐
                    │  Classification │
                    │     Head        │
                    └────────┬────────┘
                             │
                             ▼
                    Relevance Score: 0.89
            

Why Cross-Encoders Are More Accurate

The key advantage is cross-attention. When processing the combined input, the transformer can:

  • Directly compare "laptop" with "computer" and understand they're synonyms in context
  • Recognise that "slow" relates to "speed up" as problem-to-solution
  • Consider word order and grammatical relationships across the query-document boundary

This produces more nuanced relevance judgments. Cross-encoders consistently outperform bi-encoders on relevance benchmarks, often by significant margins (5-15% improvement in metrics like NDCG@10).

Cross-Encoder with Sentence-Transformerspython
1from sentence_transformers import CrossEncoder
2
3# Load a cross-encoder model trained for relevance ranking
4model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
5
6# Score query-document pairs
7query = "laptop running slow"
8documents = [
9    "Speed up your computer performance with these tips",
10    "Best practices for Python code optimization",
11    "How to troubleshoot network connectivity issues",
12    "Machine learning model deployment strategies",
13]
14
15# Create query-document pairs
16pairs = [[query, doc] for doc in documents]
17
18# Score all pairs (returns relevance scores)
19scores = model.predict(pairs)
20
21# Sort by score
22ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
23for doc, score in ranked:
24    print(f"Score: {score:.3f} | {doc}")

The Scalability Problem

Cross-encoders have a fatal flaw for large-scale retrieval: you cannot pre-compute anything. Every query requires processing the query with every document through the full transformer. For a corpus of 1 million documents:

Cross-Encoder Scaling Math

  • • Time per pair: ~5-10ms on GPU
  • • Documents: 1,000,000
  • • Total time: 5,000-10,000 seconds (1.5-3 hours per query)

This is obviously impractical for real-time search.

Cross-encoders can realistically only score hundreds to low thousands of candidates per query. This limitation is fundamental to the architecture—there's no way around it without sacrificing the cross-attention that makes them accurate.

Bi-Encoder vs Cross-Encoder: Complete Comparison

Let's compare both architectures across the dimensions that matter for production systems:

Characteristic Bi-Encoder Cross-Encoder
Speed at Query Time Very fast (milliseconds) Slow (scales linearly with corpus)
Relevance Accuracy Good Excellent
Scalability Billions of documents Hundreds per query
Pre-computation Yes - encode docs once No - must process each query
Index Updates Add new embeddings easily No index needed
Memory for Corpus Embedding storage required Just document text
GPU Requirements Query-time only (optional) Required for reasonable speed
Best For First-stage retrieval, large corpora Reranking, high-stakes decisions

Accuracy vs Speed: The Fundamental Trade-off

The performance difference isn't marginal. On standard benchmarks like MS MARCO:

Typical Benchmark Performance (MS MARCO)

Bi-Encoder (all-MiniLM-L6-v2)

  • • MRR@10: ~0.33
  • • Query latency: 20ms + search
  • • Can search 10M+ docs

Cross-Encoder (ms-marco-MiniLM-L-6-v2)

  • • MRR@10: ~0.39
  • • Query latency: ~5ms per doc
  • • Practical limit: ~1000 docs

The cross-encoder achieves roughly 18% better ranking quality, but at a cost that makes it unusable for first-stage retrieval at scale.

The Two-Stage Retrieval Pattern

The solution used by virtually every production semantic search system is two-stage retrieval: use a bi-encoder to quickly retrieve candidates, then use a cross-encoder to precisely rerank the top results.

Two-Stage Retrieval Pipeline

                        User Query
                            │
                            ▼
            ┌───────────────────────────────┐
            │       Stage 1: Retrieval       │
            │         (Bi-Encoder)           │
            │                                │
            │  • Encode query (~20ms)        │
            │  • Search vector index (~50ms) │
            │  • Return top 100 candidates   │
            └───────────────┬───────────────┘
                            │
                     Top 100 docs
                            │
                            ▼
            ┌───────────────────────────────┐
            │       Stage 2: Reranking       │
            │        (Cross-Encoder)         │
            │                                │
            │  • Score 100 pairs (~500ms)    │
            │  • Sort by relevance           │
            │  • Return top 10               │
            └───────────────┬───────────────┘
                            │
                            ▼
                    Final Results
            

This approach captures most of the cross-encoder's accuracy improvement while maintaining millisecond-scale latency. The bi-encoder's job is recall (don't miss relevant documents), while the cross-encoder's job is precision (rank the relevant ones correctly).

Complete Two-Stage Retrieval Implementationpython
1from sentence_transformers import SentenceTransformer, CrossEncoder
2import numpy as np
3from typing import List, Tuple
4
5class TwoStageRetriever:
6    def __init__(
7        self,
8        bi_encoder_model: str = 'all-MiniLM-L6-v2',
9        cross_encoder_model: str = 'cross-encoder/ms-marco-MiniLM-L-6-v2',
10        top_k_retrieval: int = 100,
11        top_k_rerank: int = 10
12    ):
13        self.bi_encoder = SentenceTransformer(bi_encoder_model)
14        self.cross_encoder = CrossEncoder(cross_encoder_model)
15        self.top_k_retrieval = top_k_retrieval
16        self.top_k_rerank = top_k_rerank
17
18        self.documents: List[str] = []
19        self.doc_embeddings: np.ndarray = None
20
21    def index_documents(self, documents: List[str]) -> None:
22        """Pre-compute and store document embeddings."""
23        self.documents = documents
24        self.doc_embeddings = self.bi_encoder.encode(
25            documents,
26            convert_to_numpy=True,
27            show_progress_bar=True
28        )
29        # Normalize for cosine similarity
30        self.doc_embeddings = self.doc_embeddings / np.linalg.norm(
31            self.doc_embeddings, axis=1, keepdims=True
32        )
33
34    def search(self, query: str) -> List[Tuple[str, float]]:
35        """Two-stage search: retrieve then rerank."""
36        # Stage 1: Bi-encoder retrieval
37        query_embedding = self.bi_encoder.encode(query, convert_to_numpy=True)
38        query_embedding = query_embedding / np.linalg.norm(query_embedding)
39
40        # Compute similarities (dot product of normalized vectors = cosine)
41        similarities = np.dot(self.doc_embeddings, query_embedding)
42
43        # Get top-k candidates
44        top_indices = np.argsort(similarities)[::-1][:self.top_k_retrieval]
45        candidates = [self.documents[i] for i in top_indices]
46
47        # Stage 2: Cross-encoder reranking
48        pairs = [[query, doc] for doc in candidates]
49        rerank_scores = self.cross_encoder.predict(pairs)
50
51        # Sort by rerank scores
52        reranked = sorted(
53            zip(candidates, rerank_scores),
54            key=lambda x: x[1],
55            reverse=True
56        )
57
58        return reranked[:self.top_k_rerank]
59
60
61# Usage example
62retriever = TwoStageRetriever()
63
64# Index your documents (do once)
65documents = [
66    "How to improve laptop performance and speed",
67    "Python programming best practices guide",
68    "Troubleshooting slow computer issues",
69    "Machine learning model optimization techniques",
70    "Windows performance tuning tips",
71    # ... thousands more documents
72]
73retriever.index_documents(documents)
74
75# Search (fast, accurate)
76results = retriever.search("my laptop is running slowly")
77for doc, score in results:
78    print(f"{score:.3f}: {doc}")

Tuning the Pipeline

The key parameters to tune are:

  • top_k_retrieval: How many candidates to retrieve. Higher values improve recall but increase reranking time. 50-200 is typical.
  • top_k_rerank: How many final results to return. Usually 10-20 for search, 3-5 for RAG.

Latency Budget Example

For a 200ms total latency budget:

  • • Query encoding: 20ms
  • • Vector search (1M docs): 30ms
  • • Cross-encoder reranking (100 docs): 150ms

This leaves headroom for network latency and allows reranking 100 candidates while staying responsive.

Business Use Cases

Understanding when each architecture shines helps you make the right choice for your specific application.

Semantic Search Systems

For customer-facing search (e-commerce, documentation, knowledge bases), the two-stage pattern is essential. Users expect sub-second responses, but also expect relevant results.

E-Commerce Product Search

  • Stage 1: Bi-encoder retrieves 200 products from millions in 50ms
  • Stage 2: Cross-encoder reranks to surface exact matches (e.g., "wireless noise-cancelling headphones" ranks higher than "wireless headphones")
  • Impact: 15-25% improvement in click-through rate

RAG (Retrieval-Augmented Generation)

For RAG systems, the quality of retrieved context directly impacts the quality of generated responses. Cross-encoder reranking is particularly valuable here.

Customer Support AI

  • Stage 1: Bi-encoder finds relevant support articles and past tickets
  • Stage 2: Cross-encoder identifies the most applicable content
  • Impact: Reduces hallucinations, improves answer accuracy by 20-30%

Duplicate Detection

Finding duplicate or near-duplicate content across large document sets. Here, bi-encoders often suffice because you're looking for high similarity rather than subtle relevance.

Content Deduplication

  • Approach: Bi-encoder embeddings with high similarity threshold (>0.9)
  • Scale: Can compare millions of documents in hours
  • Cross-encoder role: Verify borderline cases (0.85-0.95 similarity)

Recommendation Systems

Content-based recommendations using semantic similarity. Bi-encoders excel here because you need to compare user preferences against large item catalogs in real-time.

Content Recommendations

  • Approach: Embed user's reading history, find similar articles
  • Real-time: Update recommendations as user browses
  • Cross-encoder role: Rerank for diversity and freshness

When Bi-Encoder Alone Suffices

Not every use case needs two-stage retrieval. Consider bi-encoder only when:

  • • Finding similar items (not query-document matching)
  • • High similarity threshold (duplicates, near-matches)
  • • Latency constraints under 50ms
  • • Lower accuracy is acceptable

Choosing the Right Architecture

Use this decision framework to select the right approach for your use case:

Architecture Selection Decision Tree

Corpus size > 10,000 documents?

Yes → You need bi-encoder for first-stage retrieval

Real-time latency requirements (< 500ms)?

Yes → Two-stage with limited reranking candidates

High-stakes decisions (legal, medical, financial)?

Yes → Definitely add cross-encoder reranking

Simple similarity matching (duplicates, recommendations)?

Maybe bi-encoder alone is sufficient

Model Selection Guide

Choosing the right pre-trained models significantly impacts performance:

Use Case Recommended Bi-Encoder Recommended Cross-Encoder
General English all-MiniLM-L6-v2 cross-encoder/ms-marco-MiniLM-L-6-v2
Higher Accuracy all-mpnet-base-v2 cross-encoder/ms-marco-electra-base
Multilingual paraphrase-multilingual-MiniLM-L12-v2 cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
Long Documents BAAI/bge-large-en-v1.5 BAAI/bge-reranker-large
Maximum Quality intfloat/e5-large-v2 cross-encoder/stsb-roberta-large

Fine-Tuning Recommendation

Pre-trained models work well for general use, but fine-tuning on your domain data typically improves performance by 10-20%. This is especially true for specialised domains like legal, medical, or technical content. Both bi-encoders and cross-encoders can be fine-tuned using contrastive learning on query-document pairs.

💡 Need expert help with this?

Integration with Vector Databases

In production, you'll typically store bi-encoder embeddings in a vector database. Here's how the pattern works with popular options:

Two-Stage Retrieval with Pineconepython
1import pinecone
2from sentence_transformers import SentenceTransformer, CrossEncoder
3
4# Initialize
5pinecone.init(api_key="your-api-key", environment="your-env")
6index = pinecone.Index("semantic-search")
7
8bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
9cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
10
11def two_stage_search(query: str, top_k: int = 10) -> list:
12    # Stage 1: Vector search with Pinecone
13    query_embedding = bi_encoder.encode(query).tolist()
14
15    results = index.query(
16        vector=query_embedding,
17        top_k=100,  # Retrieve more for reranking
18        include_metadata=True
19    )
20
21    # Extract documents from results
22    candidates = [
23        (match.id, match.metadata['text'])
24        for match in results.matches
25    ]
26
27    # Stage 2: Cross-encoder reranking
28    pairs = [[query, doc] for _, doc in candidates]
29    scores = cross_encoder.predict(pairs)
30
31    # Combine IDs with reranked scores
32    reranked = sorted(
33        zip([c[0] for c in candidates], [c[1] for c in candidates], scores),
34        key=lambda x: x[2],
35        reverse=True
36    )
37
38    return reranked[:top_k]

The same pattern works with other vector databases like Qdrant, Weaviate, Milvus, or pgvector. The bi-encoder handles the initial retrieval from the vector index, and the cross-encoder refines the ranking.

Hybrid Search: Adding Keyword Matching

Many production systems combine semantic search with traditional keyword matching for even better results:

Hybrid Search with BM25 + Semantic + Rerankingpython
1from rank_bm25 import BM25Okapi
2import numpy as np
3
4class HybridRetriever:
5    def __init__(self, documents: list[str]):
6        # BM25 for keyword matching
7        tokenized = [doc.lower().split() for doc in documents]
8        self.bm25 = BM25Okapi(tokenized)
9
10        # Bi-encoder for semantic matching
11        self.bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
12        self.doc_embeddings = self.bi_encoder.encode(documents)
13
14        # Cross-encoder for reranking
15        self.cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
16
17        self.documents = documents
18
19    def search(self, query: str, top_k: int = 10, alpha: float = 0.5):
20        # BM25 scores
21        bm25_scores = self.bm25.get_scores(query.lower().split())
22        bm25_scores = bm25_scores / (bm25_scores.max() + 1e-6)  # Normalize
23
24        # Semantic scores
25        query_emb = self.bi_encoder.encode(query)
26        semantic_scores = np.dot(self.doc_embeddings, query_emb)
27        semantic_scores = (semantic_scores - semantic_scores.min()) / (
28            semantic_scores.max() - semantic_scores.min() + 1e-6
29        )
30
31        # Combine scores
32        hybrid_scores = alpha * semantic_scores + (1 - alpha) * bm25_scores
33
34        # Get top candidates for reranking
35        top_indices = np.argsort(hybrid_scores)[::-1][:100]
36        candidates = [self.documents[i] for i in top_indices]
37
38        # Cross-encoder reranking
39        pairs = [[query, doc] for doc in candidates]
40        rerank_scores = self.cross_encoder.predict(pairs)
41
42        reranked = sorted(
43            zip(candidates, rerank_scores),
44            key=lambda x: x[1],
45            reverse=True
46        )
47
48        return reranked[:top_k]

Conclusion

Bi-encoders and cross-encoders represent two points on the speed-accuracy trade-off curve. Bi-encoders enable searching billions of documents in milliseconds through pre-computed embeddings, while cross-encoders capture nuanced semantic relationships with superior accuracy but can only process hundreds of pairs per query.

For most production systems, the answer isn't choosing one or the other—it's combining them. The two-stage retrieval pattern (bi-encoder retrieval followed by cross-encoder reranking) has become the de facto standard because it captures most of the cross-encoder's accuracy benefits while maintaining real-time latency.

As you implement semantic search, RAG systems, or recommendation engines, start with the two-stage pattern. Tune the number of candidates retrieved and reranked based on your latency budget and accuracy requirements. And remember that fine-tuning on your specific domain data often provides the biggest performance gains of all.

Frequently Asked Questions

When should I use a bi-encoder vs a cross-encoder?

How do I fine-tune a bi-encoder or cross-encoder for my domain?

What embedding dimension should I use for bi-encoders?

How many candidates should I retrieve for cross-encoder reranking?

Can I use cross-encoders for multilingual search?

How do bi-encoders and cross-encoders compare to LLM-based reranking?

What is the impact of document length on encoder performance?

How do I evaluate my retrieval system's performance?

Ready to Implement?

This guide provides the knowledge, but implementation requires expertise. Our team has done this 500+ times and can get you production-ready in weeks.

✓ FT Fast 500 APAC Winner✓ 500+ Implementations✓ Results in Weeks