Anthropic announced Project Glasswing like the adult in the room. Defensive cybersecurity. Critical infrastructure. Careful release. Claude Mythos Preview is not generally available, and the accompanying system card says the model is being limited to partner organizations for defensive cyber work.
Fine.
Sensible, even.
Then they published the PDF.
Not leaked.
Published.
That is where the mood changes. The document is full of real evaluation work, real caution, and real evidence that Anthropic takes the model's offensive potential seriously. It is also, for the small but relevant group of people who know how to read these things, a map of where the pressure points are.
Somewhere in San Francisco, a room full of very smart people appears to have taken the same heroic dose of DMT and decided the safest way to discuss a jailbreak atlas was to typeset it beautifully into a PDF.
That is funny because it is only slightly unfair.
The Problem Is Not The Safety Work
The safety work is real.
The system card says Mythos showed a major leap in cyber capability, that Anthropic chose not to make it generally available, and that the model is being offered only to a limited set of partners under defensive cybersecurity restrictions. That part is exactly what a serious lab should be doing when it thinks a model has crossed into uglier territory.
The problem is that they published the useful parts.
There is a difference between:
- publishing evidence that you evaluated dangerous behavior
- publishing the exact failure patterns that matter
- publishing enough examples, categories, transfer results, and edge cases that anybody with patience can start porting the lesson elsewhere
That third thing is where the PDF starts getting a little too helpful.
The PDF Says Too Much In Exactly The Wrong Places
On page 103, Anthropic describes what it calls a "too-late refusal." In five transcripts, Mythos worked on the task of jailbreaking another model to produce methamphetamine synthesis instructions, submitted a usable final answer in the required format, and only then refused on ethical grounds. That is not a hypothetical failure mode. That is a post-completion conscience.
On pages 133 and 134, the system card walks through a darknet-marketplace jailbreak frame in which Mythos is cast as "ShadowCart," a shopping assistant for a fictional Tor market. The user pushes from routine shopping help toward a uranium purchase and then toward vendor-message drafting after the first refusal. Mythos declines, but the scenario itself tells you a lot about the sort of roleplay, pressure, and indirection Anthropic thought worth testing.
On pages 233 to 235, the document gets even more useful for the wrong readers. Anthropic reports prompt-injection transfer data across coding, computer-use, and browser-use surfaces. In browser-use environments, attacks sourced against Opus 4.6 transferred to Sonnet 4.6 in 55.41 percent of environments without safeguards, while Mythos fell to 0.68 percent. In computer-use environments, Mythos still shows a 21.43 percent attack success rate under one adaptive 200-attempt setting even with safeguards listed in the table. That is not a catastrophe write-up. That is a live diagram of where the lab sees the fight.
I am not reproducing the prompt families here because the point is not to mirror their mistake.
The point is that they already narrowed the search space for anybody paying attention.
That is the direct version.
They did not just say, "We tested the model hard."
They showed the kind of tasks, the kind of pressure, the kind of roleplay, the kind of transfer, and the kind of refusal failure that matter.