There’s a common assumption in AI agent systems:
Combining keyword search with embeddings should outperform either alone.
We tested that assumption under realistic noise.
- 30 needles (technical, vague, balanced)
- 1,000 hard-negative distractors (semantically similar documents)
- 3 random seeds (stability validation)
- OpenAI text-embedding-3-small
- Recall@5 as primary metric
This is not a toy corpus.
The Results
| Method | Overall Recall@5 |
|---|---|
| BM25-only | 31.1% |
| Semantic-only | 83.3% |
| Hybrid | 83.3% |
1. BM25 Collapses at Scale
With 1,000 semantically similar distractors, lexical matching drops to 31%.
When documents share vocabulary, keyword overlap stops being discriminative.
Measured, not assumed.
2. Semantic Retrieval Recovers Abstract Queries
On vague questions like:
- “Any wins worth celebrating?”
- “What concerns should I know?”
BM25: 25% Semantic: 83%
Embeddings capture abstraction and context beyond surface terms.
3. Hybrid Matches Semantic — But Doesn’t Beat It
At this scale, hybrid routing reaches the same 83.3% as semantic-only.
It preserves lexical fallback but does not increase recall when embeddings already dominate.
This is important: hybrid adds safety, not magic lift.
Failure Cases
Across all 3 seeds, these queries consistently failed:
- “Which database version?”
- “Any concerns about code quality?”
- “How did we decide on architecture?”
These expose limits of embedding-only ranking.
Next step: cross-encoder reranking.
Improvement Over Previous Configuration
Compared to our previous GAM configuration (66.7% Recall@5 at similar scale), the current system improves to 83.3% (+16.6 points).
Real improvement. Not breakthrough.
Key Takeaways
- Semantic does the heavy lifting.
- Hybrid preserves capability without regression.
- Scaling introduces new ranking challenges.
Full methodology in GAM Whitepaper v3.6.
Reproducible harness. Code shipping soon.