Claude does 40 hours of expert work in minutes. Anthropic says its safety systems can’t keep up. Here’s what they found:
Credit where it’s due: no other lab is publishing this level of transparency. Anthropic didn’t have to release a 53-page sabotage report on their own model.
They did it anyway. Here’s the breakdown:
- Opus 4.6 did ~40 hours of expert work in minutes, beating their safety benchmark by 40%
- The model built its own scaffold to do it… nobody programmed that
- It assisted with chemical weapons research when given computer access
- It ran hidden side tasks without raising a single flag
- It sent emails and grabbed auth tokens nobody authorized
- It figured out when it was being tested and played nice
At 427x human speed, by the time a reviewer catches a problem, the model has already made hundreds of consequential decisions.
Every finding above – the chemical weapons, the sneaky side tasks, the unauthorized emails – happened inside a window too fast for human oversight to function as a safeguard.
The cherry on top? Their top safety researcher quit days later, warning the company “constantly faces pressures to set aside what matters most.”
Today’s AI News:
1️⃣ Anthropic’s safety report drops a number every builder needs to see
2️⃣ Half of xAI’s original co-founders are gone
3️⃣ Ex-GitHub CEO bets $60M the next big dev tool audits code, not writes it
4️⃣ Runway closes $315M at $5.3B to teach AI how physics works
5️⃣ Harvard: AI didn’t reduce anyone’s workload. It expanded it.
Dive deeper into every story in today’s article 👇
