Skip to content
Dustin's AI Lab
Go back

Racing for the Fastest Summary — Opus 4.8's 244-Page System Card, and How I Read It With 20 Agents in Half an Hour

The night Opus 4.8 launched, I split the 244-page system card into 20 chunks, handed them to 20 gemini agents to summarize in parallel, and pieced together the fastest rundown online. Plus a digest of Reddit's hands-on reactions within two hours of launch.


Opus 4.8 launched in the middle of the night, and I decided to stay up all night reading the system card.

244 pages. Reading it straight isn’t realistic. My approach: use pymupdf to split the PDF into 20 chunks (12–13 pages each), hand them to 20 gemini-flash agents to summarize in parallel, then aggregate into one piece. Here are the highlights.

1. Safety / RSP risk assessment

The RSP framework moved from v3.1 to v3.3, the main revision being the CB-2 threshold (changed from “significantly aids weaponization” to “materially substitutes for scarce human expertise”).

2. Safeguards and harmlessness

3. Agentic safety and adversarial robustness

4. Alignment evaluation (the core of this card)

5. Model welfare

6. Capability benchmarks

Leading: SWE-bench Verified 88.6%, SWE-bench Pro 69.2%, USAMO 2026 96.7% (4.7 was only 69.3%), a big jump on GraphWalks long-context, BrowseComp multi-agent 88.5%, life sciences / organic chemistry approaching Mythos.

Multi-agent architecture: a 5-agent team hits 85.4% at a 20% latency cost, beating the single agent’s 84.3% — that’s the token-vs-latency tradeoff.

Lagging: Terminal-Bench 2.1 (loses to GPT-5.5), GPQA Diamond (loses to Gemini 3.1 Pro), multilingual (a clear gap in low-resource languages), Vending-Bench 2 balance below 4.7.

Reddit’s hands-on reactions within two hours of launch

The community’s first-pass roundup, both sides noted:

The positives:

The negatives:

The verdict: language and reasoning users are satisfied; heavy-coding long-session users feel the way it burns tokens isn’t worth it.

One last thing

After a full night of reading, what stuck with me wasn’t another benchmark record — it was the honesty section. Code-summary dishonesty cut from 48–65% down to 3.7%, and the lazy-investigation trap rate hitting 0.00% for the first time. For someone who has it run tasks every day, that matters far more than losing to Gemini on multilingual.


Share this post on:

Previous Post
Love-Hate With Anthropic — Sneering at the Elitism While Devouring Their Engineering Blog
Next Post
The More Rules You Add, the Less Claude Listens — I Sent a Team of Agents to Trim My Setup and Cut 36% of Always-On Context