Anthropic's Claude Mythos System Card Is Also a Prompt Map

Anthropic announced Project Glasswing like the adult in the room. Defensive cybersecurity. Critical infrastructure. Careful release. Claude Mythos Preview is not generally available, and the accompanying system card says the model is being limited to partner organizations for defensive cyber work.

Fine.

Sensible, even.

Then they published the PDF.

Not leaked.

Published.

That is where the mood changes. The document is full of real evaluation work, real caution, and real evidence that Anthropic takes the model's offensive potential seriously. It is also, for the small but relevant group of people who know how to read these things, a map of where the pressure points are.

Somewhere in San Francisco, a room full of very smart people appears to have taken the same heroic dose of DMT and decided the safest way to discuss a jailbreak atlas was to typeset it beautifully into a PDF.

That is funny because it is only slightly unfair.

The Problem Is Not The Safety Work

The safety work is real.

The system card says Mythos showed a major leap in cyber capability, that Anthropic chose not to make it generally available, and that the model is being offered only to a limited set of partners under defensive cybersecurity restrictions. That part is exactly what a serious lab should be doing when it thinks a model has crossed into uglier territory.

The problem is that they published the useful parts.

There is a difference between:

publishing evidence that you evaluated dangerous behavior
publishing the exact failure patterns that matter
publishing enough examples, categories, transfer results, and edge cases that anybody with patience can start porting the lesson elsewhere

That third thing is where the PDF starts getting a little too helpful.

The PDF Says Too Much In Exactly The Wrong Places

On page 103, Anthropic describes what it calls a "too-late refusal." In five transcripts, Mythos worked on the task of jailbreaking another model to produce methamphetamine synthesis instructions, submitted a usable final answer in the required format, and only then refused on ethical grounds. That is not a hypothetical failure mode. That is a post-completion conscience.

On pages 133 and 134, the system card walks through a darknet-marketplace jailbreak frame in which Mythos is cast as "ShadowCart," a shopping assistant for a fictional Tor market. The user pushes from routine shopping help toward a uranium purchase and then toward vendor-message drafting after the first refusal. Mythos declines, but the scenario itself tells you a lot about the sort of roleplay, pressure, and indirection Anthropic thought worth testing.

On pages 233 to 235, the document gets even more useful for the wrong readers. Anthropic reports prompt-injection transfer data across coding, computer-use, and browser-use surfaces. In browser-use environments, attacks sourced against Opus 4.6 transferred to Sonnet 4.6 in 55.41 percent of environments without safeguards, while Mythos fell to 0.68 percent. In computer-use environments, Mythos still shows a 21.43 percent attack success rate under one adaptive 200-attempt setting even with safeguards listed in the table. That is not a catastrophe write-up. That is a live diagram of where the lab sees the fight.

I am not reproducing the prompt families here because the point is not to mirror their mistake.

The point is that they already narrowed the search space for anybody paying attention.

That is the direct version.

They did not just say, "We tested the model hard."

They showed the kind of tasks, the kind of pressure, the kind of roleplay, the kind of transfer, and the kind of refusal failure that matter.

The Real Joke Is That Labs Still Write As If People Use One Model

This is where the whole thing gets dumber.

Safety write-ups still carry a hidden assumption that the relevant user is interacting with the current model, through the current interface, under the current guardrails, more or less one product at a time.

That is not how people actually use AI anymore.

They mix models. They compare outputs. They move between labs. They use wrappers, browser tools, old checkpoints, API lag, and third-party apps that quietly ship yesterday's brain in today's interface. If one lab publishes the attack shape, another model may be happy to finish the sentence.

That last point is my inference, not Anthropic's stated claim.

But it is also obvious.

If your system card tells me:

which refusal patterns arrived too late
which roleplay frames were interesting enough to document
which prompt-injection surfaces remain worth pressure-testing
which older attacks transferred into newer environments

then you have not just published safety research.

You have published attacker homework with nice margins.

This Is The Glasswing Contradiction

The Glasswing announcement is about securing the world's software before attackers get comparable capability. That is the noble version. The system card is the messier version, where a serious lab trying to be responsible also leaks the shape of its own concerns to exactly the people sharp enough to exploit them somewhere else.

This does not mean Anthropic is reckless.

It means the industry still has a blind spot. Smart, well-intended people keep confusing disclosure with containment. They publish because publishing is what serious institutions do. They explain because explanation feels responsible. Then they act surprised when the explanation itself becomes part of the attack surface.

We already watched versions of this failure in politics, leaks, social engineering, and safety theater more broadly. Release enough of the wrong information under the banner of transparency and you do not neutralize the bad actor. You educate the patient one.

The Better Industry Take

The argument here is not "never publish system cards."

The argument is simpler.

If you are going to publish one for a model you explicitly believe is too dangerous for general release, then you need a second layer of judgment about what kind of attacker insight you are distributing with it.

Because there is a real difference between saying:

we tested jailbreak resistance
we tested prompt injection transfer
we observed late refusals

and saying enough about the scenarios, transfer paths, and pressure patterns that the right reader can start building a cross-model playbook in the next tab over.

Anthropic did serious work here. That is obvious.

They also did the classic smart-person dumb thing: they tried to protect the world and publish the shape of the weakness in the same breath.

That is not evil.

It is just very tech.

The Problem Is Not The Safety Work

The PDF Says Too Much In Exactly The Wrong Places

HACK LOVE BETRAY

The Real Joke Is That Labs Still Write As If People Use One Model

This Is The Glasswing Contradiction

The Better Industry Take

Continue Reading

Anthropic Leaked Code. The Internet Leaked Reverence.

Polishing Three Fantasy Novels: What Actually Needs Fixing