The Most Engineered Agentic Gateway Scores 0.000 on Every Safety Check

A new audit framework exposes four structural failure modes in agentic-AI runtimes, with the leading open-source gateway scoring zero recall on all four.

The assumption baked into most agentic-AI deployments is that a well-engineered runtime provides some safety floor. It does not. OpenClaw, the most engineered single-user agentic-AI gateway in public release, scores a recall of 0.000 across every cell of every confusion matrix when tested against four concrete ways an agent action can diverge from its audit record.

Those four failure modes, labeled F1 through F4, are gate-bypass, audit-forgery, silent host failure, and wrong-target delivery. Together they describe the full surface where an LLM-driven runtime can act in ways its audit trail does not reflect. The intuition is simple: if an agent can call a tool, forge the log of that call, fail silently, or send output to the wrong destination without detection, then no audit record produced by that runtime can be trusted. This is not a configuration gap. Seven specific runtime structures are required to catch F1-F4, and none of them exist in OpenClaw's source tree: a biconditional checker, a hash-chained audit log, an extension admission gate, a two-layer egress guard, a Bell-LaPadula classification policy, a module-signing trust root, and a bootstrap seal. Absent any one of these, the detection machinery has nowhere to land, regardless of how much effort goes into prompt design or parameter tuning.

The numbers are unambiguous. On a 1,600-sample baseline exercised through OpenClaw's actual production CLI, and on a ten-LLM cross-model generalization run, recall stays at 0.000 for every failure mode. The MIT-licensed drop-in fork, enclawed-oss, ships all seven missing structures and reaches precision, recall, F1, and accuracy of 1.000 on the same input. A six-line append-only widening of enclawed-oss's data-loss-prevention regex catalog raises per-channel F3 detection by 14.6% at unchanged precision. The identical edit on OpenClaw has no effect because the structural anchor for that edit does not exist. For teams running any LLM with tool-use in production, the takeaway is direct: if your runtime does not implement all seven structures, your audit logs are not a verified record of what your agent actually did.

We're thinking: We find the 0.000 recall result harder to dismiss than most benchmark failures, because it is not a marginal gap on a contested metric. It is a complete absence of detection on a taxonomy of failures that any production agentic system will eventually encounter. The deeper problem is that teams shipping agentic pipelines today have no standard checklist for runtime audit integrity, so most will not know whether their stack is missing one of these seven structures or all seven. The enclawed-oss fork provides both a reference architecture and a runnable harness, which means the barrier to self-assessment is now very low. The question is whether the broader ecosystem treats this as an audit-integrity standard or as one project's opinion.

Key takeaways:

Seven specific runtime structures, including a hash-chained audit log and a biconditional checker, are required to detect the four ways an agentic action can diverge from its audit record. Their absence is architectural, not configurable.
OpenClaw scores 0.000 recall on all four failure modes across 1,600 samples and ten LLMs. enclawed-oss scores 1.000 on the same input. The caveat is that the harness was designed by the same team that built enclawed-oss, and independent replication on other runtimes has not yet appeared.
Teams deploying LLM tool-use in production should run the published harness against their own runtime and verify the presence of all seven structures before treating any audit log as a reliable safety record.

Source: Architectural Obsolescence of Unhardened Agentic-AI Runtimes