Opus 4.8 vs 4.7 · AnswerRocket benchmark

Where this runs

Beacon writes research answers from survey data. Every number in an answer has to trace back to a verified figure before a client reads it, and a QA pass guarantees that. Every claim a client sees is accurate either way. When the model grounds more of its own numbers up front, QA has fewer loops to run to get there.

What we found

1
Opus 4.8 backs 98% of the numbers it writes with a citation. Opus 4.7 backed 89%.
2
Answers that need a number corrected in QA fall from 92% to 44%.
3
Answer quality stays level. Blind judges landed almost exactly even between the two models, and Opus 4.8 runs a few seconds slower.

Same answer quality. Far fewer unbacked numbers. A much lighter QA load.

98%

of its numbers cited

up from 89% on Opus 4.7

56%

of answers fully clean

up from 8% on Opus 4.7

Equal

on blind quality

judges split the two almost evenly

Citation discipline

Opus 4.8 grounds far more of its numbers

More of its figures run through the citation channel, and far more answers come out with nothing for QA to fix.

Grounding rate share of numbers that carry a citation

4.8

98%

4.7

89%

Clean-answer rate answers with nothing for QA to fix

4.8

56%

4.7

Answer quality

Quality holds steady

On a blind read of one to five, the two models score the same on substance. The only lift shows up in citation discipline.

Opus 4.8Opus 4.7

Every metric

The full side by side

Green marks the better result on each row.

Metric	Opus 4.8	Opus 4.7	Change
CITATION FIDELITY
Grounding rate (valid refs / numeric tokens) valid claim-refs divided by all numeric tokens	98%	89%	+8 pp
Clean-answer rate (no dangling/free-typed) answers with zero ungrounded numbers	56%	8%	+48 pp
Valid {{c-}} refs per answer properly grounded figures per answer	45	39.15	+5.9
Dangling refs per answer refs pointing at a claim that does not exist	0	0.02	-0.0
Free-typed numbers per answer numbers typed outside the citation system	1.10	4.69	-3.6
Citations[] per answer evidence citations attached	4.98	4.27	+0.7
Unmapped [N] markers per answer [N] markers with no matching citation	4.79	4.25	+0.5
QUALITY (deterministic)
Completeness self-score (0-6) self-reported coverage of what, why, so-what	6	6	level
Key findings per answer distinct key findings	3.83	3.71	+0.1
Findings with a 'so what' findings that carry an implication	3.83	3.71	+0.1
Answers flagging uncertainty	100%	100%	level
Answers suggesting follow-ups	100%	100%	level
Answer length (words) answer length	345.21	408.94	-63.7
Charts selected per answer charts chosen to support the answer	1.92	1.88	+0.0
QUALITY SCORE (1-5)
Judge: relevance	5.00	5.00	level
Judge: clarity	4.90	4.88	+0.02
Judge: insight depth of analysis, blind judge	4.94	5.00	-0.06
Judge: citation discipline figure discipline, blind judge	4.71	4.56	+0.15
Judge: overall mean of the four judge scores	4.89	4.86	+0.03
COST / LATENCY
Mean latency (s) wall-clock per answer	80s	74s	+6s
Mean output tokens answer size in tokens	6780.85	5364.31	+1417

How we measured

We replayed the production synthesis prompt across a set of research questions, several times on each model, and scored the raw output before QA. Each answer also received a blind quality score for relevance, clarity, insight, and citation discipline.

Citation handling

The main focus.

Grounding rate: Valid claim-refs divided by all numeric tokens in the answer.
Clean-answer rate: Share of answers with zero ungrounded or free-typed numbers.
Free-typed numbers: Figures typed straight into prose instead of through the citation system.
Dangling refs: Citation refs that point at a claim which does not exist.

Answer quality

Two reads, one mechanical and one judged.

Deterministic checks: Completeness of what, why and so-what, findings count, uncertainty and follow-up flags, length, and charts.
Quality score, 1 to 5: A blind score across relevance, clarity, insight, and citation discipline.
Blind pairwise win-rate: A blind judge picks the stronger of two anonymized answers to the same question.

Cost and latency

The operational cost of the lift.

Latency: Wall-clock time per answer.
Output tokens: Size of each answer.