There’s a common assumption in AI agent systems:

Combining keyword search with embeddings should outperform either alone.

We tested that assumption under realistic noise.

  • 30 needles (technical, vague, balanced)
  • 1,000 hard-negative distractors (semantically similar documents)
  • 3 random seeds (stability validation)
  • OpenAI text-embedding-3-small
  • Recall@5 as primary metric

This is not a toy corpus.


The Results

MethodOverall Recall@5
BM25-only31.1%
Semantic-only83.3%
Hybrid83.3%

1. BM25 Collapses at Scale

With 1,000 semantically similar distractors, lexical matching drops to 31%.

When documents share vocabulary, keyword overlap stops being discriminative.

Measured, not assumed.

2. Semantic Retrieval Recovers Abstract Queries

On vague questions like:

  • “Any wins worth celebrating?”
  • “What concerns should I know?”

BM25: 25% Semantic: 83%

Embeddings capture abstraction and context beyond surface terms.

3. Hybrid Matches Semantic — But Doesn’t Beat It

At this scale, hybrid routing reaches the same 83.3% as semantic-only.

It preserves lexical fallback but does not increase recall when embeddings already dominate.

This is important: hybrid adds safety, not magic lift.


Failure Cases

Across all 3 seeds, these queries consistently failed:

  • “Which database version?”
  • “Any concerns about code quality?”
  • “How did we decide on architecture?”

These expose limits of embedding-only ranking.

Next step: cross-encoder reranking.


Improvement Over Previous Configuration

Compared to our previous GAM configuration (66.7% Recall@5 at similar scale), the current system improves to 83.3% (+16.6 points).

Real improvement. Not breakthrough.


Key Takeaways

  • Semantic does the heavy lifting.
  • Hybrid preserves capability without regression.
  • Scaling introduces new ranking challenges.

Full methodology in GAM Whitepaper v3.6.

Reproducible harness. Code shipping soon.