When Q-RAG is the right choice (and when it isn't) — head-to-head with 11 popular RAG models
When Q-RAG is the right choice (and when it isn't) — a head-to-head with 11 popular RAG models
Hi — this is the maintainer. I've been getting questions about how Q-RAG-50M-Sovereign positions against the popular open-source RAG stack — BGE, e5, mxbai, ms-marco MiniLM, gte-reranker — so I wrote it up properly. Both where we win, and where we lose. You can use this to decide whether Q-RAG belongs in your stack.
All numbers in this post are from the public reproduction scripts in the upstream research repo and the audit JSONs (benchmark_vs_embeddings.json, benchmark_beir.json) in this model repo. You can re-run everything yourself.
TL;DR — when to pick Q-RAG
Pick Q-RAG when:
- Your RAG stack lives in the finance / code / general-knowledge / how-to domains (the 10 domains it was trained on).
- You care about refusing wrong-passage relevance signal more than about ranking-quality on web-IR benchmarks. (The failure mode that drives RAG hallucinations isn't "the relevant passage ranked too low" — it's "an irrelevant passage scored high enough to reach the generator.")
- You want a binary 1/0 output with no threshold tuning. Drop it in, ship it.
- You want a fully sovereign Apache-2.0 weight tree with no borrowed foundation model in the lineage.
- You want on-device CPU inference with no embedding service, no rerank API, no licensing entanglement.
Don't pick Q-RAG when:
- Your retrieval is exclusively medical literature search or scientific claim verification. Run
bge-small-en-v1.5(93.2%) — it's 3.6 points better than us on the BEIR NFCorpus + SciFact slice. - You need multilingual retrieval. Q-RAG is English-only. Pick
bge-m3orjina-reranker-v2-multilingual. - You need a vector-output retriever for ANN indexing. Q-RAG is not an embedding model. Use it after your dense retriever, not instead.
The headline result: Q-RAG is #1 on its training distribution, mid-pack on out-of-distribution medical IR
I tested Q-RAG against eleven popular open-source RAG-stack models on two benchmarks: our in-house holdout (10 trained domains, 30 rows) and a public BEIR slice (NFCorpus + SciFact, 250 rows, fully out-of-distribution). For every comparison, the embedding models and rerankers are given an oracle threshold — the single threshold that maximizes their accuracy on the full holdout. That's a generous upper bound, not a fair split — but it's what most stacks would calibrate to per-domain anyway.
Q-RAG outputs 1 or 0 directly with no threshold to tune.
Table 1 — In-distribution (10-domain Q-RAG holdout, 30 rows)
This is the benchmark Q-RAG was trained for. The "Cross-18" column tests cross-domain off-topic refusal — query in domain A, passage in domain B — which is the failure mode that drives most production RAG hallucinations.
| Rank | Model | Params | Kind | Overall | Same-domain (12) | Cross-domain refusal (18) |
|---|---|---|---|---|---|---|
| 🥇 1 | Q-RAG-50M-Sovereign | 50M | q-rag | 100.0% | 100.0% | 100.0% |
| 🥈 2 | BAAI/bge-reranker-large | 560M | reranker | 96.7% | 100.0% | 94.4% |
| 🥈 2 | BAAI/bge-reranker-v2-m3 | 568M | reranker | 96.7% | 100.0% | 94.4% |
| 4 | cross-encoder/ms-marco-MiniLM-L-6-v2 | 23M | reranker | 93.3% | 100.0% | 88.9% |
| 4 | cross-encoder/ms-marco-MiniLM-L-12-v2 | 33M | reranker | 93.3% | 100.0% | 88.9% |
| 4 | mixedbread-ai/mxbai-rerank-xsmall-v1 | 70M | reranker | 93.3% | 100.0% | 88.9% |
| 4 | Alibaba-NLP/gte-reranker-modernbert-base | 149M | reranker | 93.3% | 100.0% | 88.9% |
| 8 | intfloat/e5-small-v2 | 33M | embed | 90.0% | 100.0% | 83.3% |
| 8 | BAAI/bge-reranker-base | 278M | reranker | 90.0% | 100.0% | 83.3% |
| 10 | BAAI/bge-small-en-v1.5 | 33M | embed | 86.7% | 100.0% | 77.8% |
| 10 | BAAI/bge-m3 | 568M | embed | 86.7% | 91.7% | 83.3% |
The honest reading. Q-RAG beats every model we tested on its training distribution. Notably:
- BGE-reranker-v2-m3 (568M params, 11.4× our size) loses by 3.3 points overall and 5.6 points on cross-domain refusal.
- BGE-reranker-large (560M) ties v2-m3 — 11× more parameters doesn't unlock the cross-domain refusal pattern.
- The smaller MS-MARCO MiniLM family (23M–33M) trades ~7 points overall vs Q-RAG.
- bge-m3 (568M) has a real bge-m3-shaped failure pattern: it can't break 92% even at oracle threshold.
This is what "first-class cross-domain refusal training" looks like at the score level — we trade some BEIR ranking quality (next table) for hard refusal on data we trained for.
Table 2 — Out-of-distribution (BEIR NFCorpus + SciFact slice, 250 rows)
This is the public BEIR slice — medical literature retrieval (NFCorpus) and scientific claim verification (SciFact). Q-RAG was not trained on these domains. The slice is 25 queries per dataset × 5 candidates each (1 positive + 4 hard negatives).
| Rank | Model | Params | Kind | Acc | CPU lat (ms) |
|---|---|---|---|---|---|
| 🥇 1 | BAAI/bge-small-en-v1.5 | 33M | embed | 93.2% | 38 |
| 🥈 2 | cross-encoder/ms-marco-MiniLM-L-6-v2 | 23M | reranker | 92.4% | 19 |
| 🥈 2 | Alibaba-NLP/gte-reranker-modernbert-base | 149M | reranker | 92.4% | 147 |
| 4 | intfloat/e5-small-v2 | 33M | embed | 92.0% | 37 |
| 5 | BAAI/bge-reranker-v2-m3 | 568M | reranker | 90.8% | 391 |
| 5 | BAAI/bge-m3 | 568M | embed | 90.8% | 396 |
| 7 | cross-encoder/ms-marco-MiniLM-L-12-v2 | 33M | reranker | 90.4% | 38 |
| 7 | BAAI/bge-reranker-base | 278M | reranker | 90.4% | 119 |
| 9 | Q-RAG-50M-Sovereign | 50M | q-rag | 89.6% | 168 |
| 9 | mixedbread-ai/mxbai-rerank-xsmall-v1 | 70M | reranker | 89.6% | 919 |
| 11 | BAAI/bge-reranker-large | 560M | reranker | 88.4% | 392 |
The honest reading. On medical+scientific OOD data, Q-RAG lands #9 of 12 at 89.6%. But look at the actual gap:
- Leader (
bge-small-en-v1.5) is 3.6 points ahead. That's nine rows out of 250. - We beat BGE-reranker-large (560M, 11.2× the params) by 1.2 points.
- We tie mxbai-rerank-xsmall (70M, same-class) at 89.6%.
- We tie BGE-reranker-v2-m3 and bge-m3 (both 568M) within rounding (90.8% vs 89.6%, 1.2 pt gap).
The field is tight. Eleven models, all within a 4.8-point band on a 250-row holdout. A 50M model trading punches with 568M-param rerankers on data it wasn't trained for is the actual "punching above its weight" claim.
Table 3 — Combined view: which model wins where?
This is the decision table — the one you should use to pick a model.
| Use case | Winner | Score | Why |
|---|---|---|---|
| In-distribution refusal (finance / code / general-knowledge / how-to) | Q-RAG-50M-Sovereign | 100% | Trained on cross-domain refusal as a first-class objective with high-weight adversarial negatives |
| Medical literature retrieval (NFCorpus-shaped) | bge-small-en-v1.5 | 93.2% | Tuned on biomedical IR with proper passage-length context |
| Scientific claim verification (SciFact-shaped) | bge-small-en-v1.5 | 93.2% | Same as above; small but trained on this distribution |
| Multilingual retrieval | bge-m3 / jina-reranker-v2 | n/a tested | Q-RAG is English-only |
| Web passage ranking (MS MARCO-shaped) | ms-marco-MiniLM-L-6-v2 | 92.4% | Literally trained on MS MARCO |
| Best parameter efficiency at refusal | Q-RAG-50M-Sovereign | 100% / 89.6% | Best ratio of (in-house + OOD avg) to params in the field |
| Lowest CPU latency | ms-marco-MiniLM-L-6-v2 | 19 ms | Smallest model, shortest sequence at inference |
| Largest training pretrain bet | bge-m3 / bge-reranker-v2-m3 | 90.8% | 568M params didn't translate to a top score on either holdout |
Table 4 — Where the leaders fail to beat Q-RAG
The honest way to read the BEIR table is: which larger models did NOT beat us?
| Model that did NOT beat Q-RAG on BEIR | Params | Their score | Our score | Gap |
|---|---|---|---|---|
| BAAI/bge-reranker-large | 560M | 88.4% | 89.6% | +1.2 ours |
| mixedbread-ai/mxbai-rerank-xsmall-v1 | 70M | 89.6% | 89.6% | tie |
| BAAI/bge-reranker-v2-m3 | 568M | 90.8% | 89.6% | −1.2 |
| BAAI/bge-m3 | 568M | 90.8% | 89.6% | −1.2 |
| BAAI/bge-reranker-base | 278M | 90.4% | 89.6% | −0.8 |
| cross-encoder/ms-marco-MiniLM-L-12-v2 | 33M | 90.4% | 89.6% | −0.8 |
Six of the eleven competitors ran slower, larger, and at or below Q-RAG's BEIR score. The top of the table is genuinely small efficient embedding models tuned for IR; the bottom of the table is bloat that didn't translate.
The technical bet that makes the in-distribution win possible
Three training choices, applied together. None individually novel; the combination is what works at 50M params.
1. Cross-domain refusal as a first-class training objective, not a side effect
Embedding models (e5, BGE) are trained on positive ranking signal — MS MARCO click-through, NLI entailment, web search clicks. They learn what "more relevant" looks like, then hope the threshold separates the relevant from the irrelevant.
Q-RAG was trained explicitly on cross-domain off-topic refusal as a label. Every query in training was paired against 5 passages drawn from other domains, labeled 0, weighted higher than the positives during loss computation. The model learned that the default for "wrong domain" is refuse, not score low and hope the threshold catches it. Result: 100% on the cross-domain refusal subset, where bge-m3 (568M) drops to 83.3%.
2. Adversarial same-domain near-miss negatives
The hardest failure for an embedding model is a same-shape-but-wrong-specific-answer passage. "Paris is the capital of France" sits near "Berlin is the capital of Germany" in embedding space — same sentence structure, same topic family, same vocabulary register. Cosine similarity says yes; relevance says no.
For every topic in training, Q-RAG sees 4–6 same-domain wrong-specific-answer passages weighted even higher than the positives. The model learned the shape of "wrong-but-shaped-right" and refuses cleanly. This is the failure mode that drives most production RAG hallucinations.
3. Binary token output, not a score
Embedding models output a vector — you compare via cosine and tune a threshold per domain. Cross-encoder rerankers output a logit — same threshold problem. Both leave the calibration as the operator's problem.
Q-RAG outputs a single token: 1 or 0. No threshold. No per-domain calibration. Drop it in after your dense retriever; pass through passages with 1; refuse if none score 1. The training objective is binary cross-entropy on that exact token; the inference path is one argmax on the next-token distribution. No magic, no calibration.
How to deploy it in your stack
The right slot for Q-RAG is between your retriever and your generator:
[query] → [dense retriever: e5 / bge / your VDB] → top-k passages
→ [Q-RAG: for each passage, score 1/0] → filtered passages
→ [your big LLM: answer using only the relevant ones]
The cheapest version of this in Python:
def rag_answer(query, top_k_passages, big_llm):
relevant = [p for p in top_k_passages if q_rag.score(query, p) == 1]
if not relevant:
return "I don't have evidence to answer that."
return big_llm.generate(query, context=relevant)
That last if not relevant branch is where the refusal training pays off. If your retriever returned five passages and Q-RAG says 0 to all five, your generator never gets the chance to hallucinate around irrelevant evidence.
Latency tradeoff (honest)
Q-RAG runs at ~170 ms per (query, passage) pair on CPU. The MS-MARCO MiniLM family runs at 19–38 ms. The BGE rerankers run at 119–392 ms. We're slower than the small MiniLMs but faster than the 568M BGE models.
But this latency is paid per passage, after retrieval and before generation. If you're already paying ~5 seconds for a 7B LLM generation, the difference between 19ms and 170ms per passage is invisible. The cost that matters is: does Q-RAG's refusal save you from hallucinating into one bad retrieved passage? If yes, the latency is free.
When to pick something else (honest)
Three concrete situations where Q-RAG is the wrong tool:
- Pure medical IR pipeline. Run
bge-small-en-v1.5— 33M params, 93.2% on the NFCorpus slice, 38 ms latency. We're 3.6 points behind on that distribution. - Pure MS-MARCO-shaped web passage ranking. Run
cross-encoder/ms-marco-MiniLM-L-6-v2— 23M params, 19 ms latency, 92.4% on the BEIR slice. We're 2.8 points behind. - Multilingual retrieval. Run
bge-m3orjina-reranker-v2-multilingual. Q-RAG is English-only.
For everything else — mixed domains, finance + code + general-knowledge, especially when you want a hard refusal pattern instead of a fragile threshold — Q-RAG is the strongest open-source choice we've measured.
The reproduction invitation (still open)
If you run Q-RAG against any model we missed — Cohere Rerank, Voyage Rerank, jina-reranker-v2, ColBERT, your domain-tuned reranker — open a discussion on this repo with the numbers. We'll add the result to the table, honestly, whichever direction it falls.
The benchmark scripts and the 30-row holdout are in this repo. The BEIR slice is reproducible from BeIR/nfcorpus + BeIR/scifact with the script in the upstream research repo.
Site / Discord / Ko-fi
- Site (Qovaryx runtime): https://qovaryx.jehorizon.com
- Discord: https://discord.gg/PtuHZDv5ju
- Ko-fi (we cover GPU bills): https://ko-fi.com/tjarvis91
- Research devlog: https://github.com/thron-j/qovaryx-ai-research
The full Qovaryx runtime that orchestrates Q-RAG alongside the nine other 50M sovereign specialists ships at the site above.
— Thomas (tjarvis91)