Agents Score 55% on Belief Invalidation: The Silent Memory-Rot Problem

STALE benchmark reveals frontier LLMs fail to detect implicit memory conflicts, scoring only 55.2% on belief invalidation across 1,200 queries.

Agent memory benchmarks measure the wrong thing. They test whether a model can retrieve the latest fact. They do not test whether a model knows that an older fact is now wrong, especially when nothing in the new information explicitly says so.

That gap is not a minor edge case. It is the default condition of long-running agents operating in any environment where the world changes gradually: a user's job changes, a subscription lapses, a preference shifts. No explicit negation arrives. The old memory just quietly stops being true.

STALE isolates this failure mode with precision. The benchmark constructs 400 expert-validated conflict scenarios across more than 100 everyday topics, generating 1,200 evaluation queries along three distinct probing dimensions. The first dimension, State Resolution, asks whether the model detects that a prior belief is outdated. The second, Premise Resistance, tests whether the model rejects queries that falsely assume the stale state still holds. The third, Implicit Policy Adaptation, checks whether the model proactively updates downstream behavior when one aspect of a user's state changes and a related memory should follow. Context windows run up to 150K tokens, matching realistic agent deployment conditions.

The architecture of the failure is worth examining carefully. In a typical explicit conflict, a later memory says "X is no longer true." The model just needs to surface the more recent entry. Implicit conflicts work differently: a later observation implies that X cannot still be true, but never states it. A user mentions starting a new job; the model should infer that the old commute time, the old lunch preferences near the previous office, and the old scheduling constraints are likely obsolete. Current systems do not make that inference. They retrieve the updated fact correctly but continue acting on the stale surrounding context, because the connection between the new observation and the old belief requires commonsense reasoning across memory boundaries, not just recency ranking.

CUPMem, the prototype introduced alongside the benchmark, addresses this at write time rather than read time. Instead of waiting for retrieval to surface conflicts, CUPMem applies structured state consolidation when new observations arrive, explicitly adjudicating which stored beliefs the new state invalidates and propagating that invalidation before the memory is ever queried. The design separates two operations that current memory frameworks conflate: storing what is new and retiring what it obsoletes.

The best evaluated frontier model scores 55.2% overall accuracy across all three probing dimensions. Models accept stale premises embedded in user queries at high rates, and they show near-zero ability to recognize that a state change in one domain should invalidate memories in adjacent domains. For teams shipping agents with persistent memory today, the takeaway is direct: any memory system that does not perform write-time belief adjudication is silently accumulating stale state that retrieval alone will not surface.

We're thinking: We find the Premise Resistance dimension the most operationally alarming result here. A model that fails State Resolution at least retrieves the wrong answer from an honest question. A model that fails Premise Resistance actively agrees with a user's false assumption, which means it reinforces the stale belief rather than correcting it. In production, that failure mode compounds: the agent confirms the outdated state, the user trusts the confirmation, and the error propagates downstream into decisions. The 55.2% ceiling is not just a benchmark number. It is a description of every long-running personalized agent currently deployed, all of which are operating on belief sets that degrade silently and confirm stale premises on request.

Key takeaways:

Implicit belief invalidation, where a new observation obsoletes an old memory without explicit negation, requires commonsense inference across memory boundaries; current retrieval-based systems do not perform this inference, and write-time state adjudication is the structural fix.
The best frontier model scores 55.2% across 1,200 queries on STALE's three-dimensional framework; all evaluated models show near-zero performance on cross-domain invalidation propagation, and the benchmark covers only everyday topics, not specialized domains where implicit conflicts are likely more frequent.
Teams building agents with long-term personalized memory should audit whether their memory layer performs any write-time belief retirement, and treat CUPMem's state consolidation and propagation-aware search as a design baseline rather than an optional enhancement.

Source: STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?