There are two memory problems in production AI systems. Engineering teams spend most of their time on the first one. Almost no one tests the second.
The first is familiar: the model forgets. Context doesn't persist between sessions, users repeat themselves, the system feels inconsistent. The industry has built an entire tooling layer around fixing this — RAG pipelines, vector stores, conversation summarization, dedicated memory services. It's a solved problem, more or less, and there's no shortage of frameworks promising to solve it harder.
The second problem looks like the system working correctly. That's why nobody catches it.
The Diagram Was Right About the Model
Here's the setup. Sensitive data enters the pipeline — PII from a user input, confidential content from an uploaded document, credentials that came through a system prompt, an account number mentioned in turn three of a multi-step agent task. The system processes it. The response goes out. The developer checks their architecture diagram: stateless LLM call, no persistence, conversation cleared after session end.
Clean. The model is stateless by design. That part is true.
What the diagram doesn't show is everything around the model.
The conversation hit a logging endpoint before it reached inference. The retrieved chunks from RAG are sitting in a query cache with a 30-day TTL. The tool call that looked up the user's account details wrote its inputs and outputs to an observability trace store. The prompt got sampled into an eval dataset that three people on the team have access to. The user-provided content that was embedded for retrieval is still in the vector index because nobody implemented a delete workflow.
The model forgot everything. The infrastructure remembered all of it.
Why This Is Hard to See
LLMs are stateless at the API level. That property is so foundational to how developers think about them that it bleeds into how they think about the whole system. You call the API, you get a response, nothing persists. That mental model is correct for the model and wrong for the pipeline.
The data retention in a modern AI application isn't centralized or intentional. It's a byproduct of infrastructure decisions made by different teams with different priorities. The platform team set the log retention policy at 90 days because that's standard for debugging. The ML team built the eval pipeline to sample production conversations because that's how you improve the model. The infra team cached RAG results for performance because retrieval latency was killing p95. None of these decisions were wrong in isolation. Nobody reviewed them together against a data sensitivity model.
This is the gap. Not a vulnerability in the classic sense — no CVE, no patch, no exploit chain. A structural mismatch between what the system was designed to handle and what it actually processes in production.
The Context Window Is Not a Secure Enclave
The other failure mode runs in the opposite direction — not data persisting where it shouldn't, but instructions not persisting where they should.
Production agent systems manage context window pressure through truncation, summarization, or retrieval. This is necessary. Context windows have limits, conversations run long, and sending the full history on every turn is expensive. The strategies for managing this are well understood.
What's less well understood is what gets lost in the compression.
A safety constraint introduced in the system prompt gets summarized down to a vague reference by turn twelve. An instruction to always verify user identity before accessing account data is present verbatim at turn one, compressed to "check identity first" at turn eight, and absent entirely at turn fifteen. The model at turn sixteen isn't ignoring the instruction. The instruction isn't in context.
This is the actual mechanism behind a large class of multi-turn prompt injection attacks. The exploit isn't sophisticated. The attacker introduces content early in a conversation that will survive summarization and gradually reframe the model's operating context. Or they simply apply conversational pressure until the original constraints scroll out of the active window. The attack surface is time and token budget, not a technical vulnerability in the model itself.