Model evaluation

Opus 4.8 vs 4.7

We put Anthropic's newest model into the step that writes client answers, and checked how well it backs every number with a citation.

Opus 4.8
challenger
Space Rocket Launch Streamline Icon: https://streamlinehq.com space-rocket-launch
Space Rocket Launch Streamline Icon: https://streamlinehq.com space-rocket-launch
Opus 4.7
incumbent
Where this runs

Beacon writes research answers from survey data. Every number in an answer has to trace back to a verified figure before a client reads it, and a QA pass guarantees that. Every claim a client sees is accurate either way. When the model grounds more of its own numbers up front, QA has fewer loops to run to get there.

What we found
  • 1

    Opus 4.8 backs 98% of the numbers it writes with a citation. Opus 4.7 backed 89%.

  • 2

    Answers that need a number corrected in QA fall from 92% to 44%.

  • 3

    Answer quality stays level. Blind judges landed almost exactly even between the two models, and Opus 4.8 runs a few seconds slower.

Same answer quality. Far fewer unbacked numbers. A much lighter QA load.

98%
of its numbers cited
up from 89% on Opus 4.7
56%
of answers fully clean
up from 8% on Opus 4.7
Equal
on blind quality
judges split the two almost evenly
Citation discipline

Opus 4.8 grounds far more of its numbers

More of its figures run through the citation channel, and far more answers come out with nothing for QA to fix.

Grounding rate share of numbers that carry a citation
4.8
98%
4.7
89%
Clean-answer rate answers with nothing for QA to fix
4.8
56%
4.7
8%
Answer quality

Quality holds steady

On a blind read of one to five, the two models score the same on substance. The only lift shows up in citation discipline.

123455.005.00Relevance4.904.88Clarity4.945.00Insight4.714.56Citation
Opus 4.8Opus 4.7
Every metric

The full side by side

Green marks the better result on each row.

MetricOpus 4.8Opus 4.7Change
CITATION FIDELITY
Grounding rate (valid refs / numeric tokens)
valid claim-refs divided by all numeric tokens
98%89%+8 pp
Clean-answer rate (no dangling/free-typed)
answers with zero ungrounded numbers
56%8%+48 pp
Valid {{c-}} refs per answer
properly grounded figures per answer
4539.15+5.9
Dangling refs per answer
refs pointing at a claim that does not exist
00.02-0.0
Free-typed numbers per answer
numbers typed outside the citation system
1.104.69-3.6
Citations[] per answer
evidence citations attached
4.984.27+0.7
Unmapped [N] markers per answer
[N] markers with no matching citation
4.794.25+0.5
QUALITY (deterministic)
Completeness self-score (0-6)
self-reported coverage of what, why, so-what
66level
Key findings per answer
distinct key findings
3.833.71+0.1
Findings with a 'so what'
findings that carry an implication
3.833.71+0.1
Answers flagging uncertainty100%100%level
Answers suggesting follow-ups100%100%level
Answer length (words)
answer length
345.21408.94-63.7
Charts selected per answer
charts chosen to support the answer
1.921.88+0.0
QUALITY SCORE (1-5)
Judge: relevance5.005.00level
Judge: clarity4.904.88+0.02
Judge: insight
depth of analysis, blind judge
4.945.00-0.06
Judge: citation discipline
figure discipline, blind judge
4.714.56+0.15
Judge: overall
mean of the four judge scores
4.894.86+0.03
COST / LATENCY
Mean latency (s)
wall-clock per answer
80s74s+6s
Mean output tokens
answer size in tokens
6780.855364.31+1417
How we measured

We replayed the production synthesis prompt across a set of research questions, several times on each model, and scored the raw output before QA. Each answer also received a blind quality score for relevance, clarity, insight, and citation discipline.

Citation handling

The main focus.

Grounding rate
Valid claim-refs divided by all numeric tokens in the answer.
Clean-answer rate
Share of answers with zero ungrounded or free-typed numbers.
Free-typed numbers
Figures typed straight into prose instead of through the citation system.
Dangling refs
Citation refs that point at a claim which does not exist.

Answer quality

Two reads, one mechanical and one judged.

Deterministic checks
Completeness of what, why and so-what, findings count, uncertainty and follow-up flags, length, and charts.
Quality score, 1 to 5
A blind score across relevance, clarity, insight, and citation discipline.
Blind pairwise win-rate
A blind judge picks the stronger of two anonymized answers to the same question.

Cost and latency

The operational cost of the lift.

Latency
Wall-clock time per answer.
Output tokens
Size of each answer.