We put Anthropic's newest model into the step that writes client answers, and checked how well it backs every number with a citation.
Beacon writes research answers from survey data. Every number in an answer has to trace back to a verified figure before a client reads it, and a QA pass guarantees that. Every claim a client sees is accurate either way. When the model grounds more of its own numbers up front, QA has fewer loops to run to get there.
Opus 4.8 backs 98% of the numbers it writes with a citation. Opus 4.7 backed 89%.
Answers that need a number corrected in QA fall from 92% to 44%.
Answer quality stays level. Blind judges landed almost exactly even between the two models, and Opus 4.8 runs a few seconds slower.
Same answer quality. Far fewer unbacked numbers. A much lighter QA load.
More of its figures run through the citation channel, and far more answers come out with nothing for QA to fix.
On a blind read of one to five, the two models score the same on substance. The only lift shows up in citation discipline.
Green marks the better result on each row.
| Metric | Opus 4.8 | Opus 4.7 | Change |
|---|---|---|---|
| CITATION FIDELITY | |||
| Grounding rate (valid refs / numeric tokens) valid claim-refs divided by all numeric tokens | 98% | 89% | +8 pp |
| Clean-answer rate (no dangling/free-typed) answers with zero ungrounded numbers | 56% | 8% | +48 pp |
| Valid {{c-}} refs per answer properly grounded figures per answer | 45 | 39.15 | +5.9 |
| Dangling refs per answer refs pointing at a claim that does not exist | 0 | 0.02 | -0.0 |
| Free-typed numbers per answer numbers typed outside the citation system | 1.10 | 4.69 | -3.6 |
| Citations[] per answer evidence citations attached | 4.98 | 4.27 | +0.7 |
| Unmapped [N] markers per answer [N] markers with no matching citation | 4.79 | 4.25 | +0.5 |
| QUALITY (deterministic) | |||
| Completeness self-score (0-6) self-reported coverage of what, why, so-what | 6 | 6 | level |
| Key findings per answer distinct key findings | 3.83 | 3.71 | +0.1 |
| Findings with a 'so what' findings that carry an implication | 3.83 | 3.71 | +0.1 |
| Answers flagging uncertainty | 100% | 100% | level |
| Answers suggesting follow-ups | 100% | 100% | level |
| Answer length (words) answer length | 345.21 | 408.94 | -63.7 |
| Charts selected per answer charts chosen to support the answer | 1.92 | 1.88 | +0.0 |
| QUALITY SCORE (1-5) | |||
| Judge: relevance | 5.00 | 5.00 | level |
| Judge: clarity | 4.90 | 4.88 | +0.02 |
| Judge: insight depth of analysis, blind judge | 4.94 | 5.00 | -0.06 |
| Judge: citation discipline figure discipline, blind judge | 4.71 | 4.56 | +0.15 |
| Judge: overall mean of the four judge scores | 4.89 | 4.86 | +0.03 |
| COST / LATENCY | |||
| Mean latency (s) wall-clock per answer | 80s | 74s | +6s |
| Mean output tokens answer size in tokens | 6780.85 | 5364.31 | +1417 |
We replayed the production synthesis prompt across a set of research questions, several times on each model, and scored the raw output before QA. Each answer also received a blind quality score for relevance, clarity, insight, and citation discipline.
The main focus.
Two reads, one mechanical and one judged.
The operational cost of the lift.