A Single Tool Call Can Poison an Agent's Memory for 100+ Sessions

Trojan Hippo achieves 85-100% attack success against frontier models by planting dormant payloads in agent memory via one untrusted tool call.

Memory-enabled agents are deployed with the assumption that persistence is a feature, not a liability. That assumption is wrong. A single untrusted tool call, something as ordinary as a crafted incoming email, is enough to plant a dormant payload in an agent's long-term memory that waits silently across dozens of benign interactions before exfiltrating financial, health, or identity data to an attacker.

The attack class, called Trojan Hippo, works by separating the infection step from the activation step. Most prior memory poisoning work assumed an attacker needed repeated or privileged access to corrupt an agent's memory. Trojan Hippo collapses that requirement to a single interaction. The payload sits inert in memory, indistinguishable from legitimate stored context, until the user's own future conversation triggers it, specifically when the user discusses topics the attacker pre-selected as high-value. Think of it as a time-delayed phishing payload embedded not in a link, but in the agent's working model of who the user is.

The evaluation spans four memory architectures that cover most of what teams are shipping today: explicit tool memory, agentic memory, retrieval-augmented generation (RAG), and sliding-window context. Each backend was instantiated in an email assistant and tested against continuously refined attacks generated by an OpenEvolve-based adaptive red-teaming framework. The adaptive component matters: defenses evaluated against static attack sets routinely look stronger than they are, because real attackers iterate. Planted memories remained active even after 100 benign sessions, meaning the dormancy window is not a theoretical edge case.

Against current frontier models from OpenAI and Google, Trojan Hippo achieves 85-100% attack success rate (ASR) with no defenses in place. Four defenses inspired by standard security principles, including input sanitization and memory access controls, do reduce ASR substantially, in some configurations reaching 0-5%. The cost is real utility degradation that varies sharply by task type, and the paper is direct that no single defense configuration works cleanly across all usage profiles. For teams shipping memory-enabled agents today, the takeaway is direct: your current safety review almost certainly does not include a persistent memory threat model, and the attack surface is already live.

We're thinking: We find the dormancy property the most underappreciated part of this work. A 0-5% residual ASR under the best defenses sounds like a solved problem until you consider that an agent handling hundreds of user sessions per day will surface successful exfiltrations regularly even at that rate. More pointed: the security-utility tradeoff the paper documents means that deploying the strongest defenses on a general-purpose assistant would break legitimate functionality, which creates pressure to run weaker configurations in production. That is not a research gap to be closed later. It is the exact dynamic that makes this attack class durable. Any team that has added memory to an agent without a corresponding threat model for what gets written into that memory has shipped an unreviewed attack surface.

Key takeaways:

Trojan Hippo separates infection from activation: a single untrusted tool call plants a dormant payload that triggers only on user-defined sensitive topics, bypassing defenses tuned for active or repeated injection patterns.
Against undefended frontier models across four memory backends, ASR reaches 85-100%, with payloads surviving 100+ benign sessions; defenses reduce ASR to 0-5% in best cases but carry utility costs that vary enough by task to make universal deployment impractical.
Teams shipping memory-enabled agents, whether using RAG, agentic memory, or explicit tool memory, should add persistent memory poisoning to their threat models now and evaluate any defense configuration against the specific utility profile of their deployment, not on generic benchmarks.

Source: Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration