SAE Feature Clamping Gets a 95.8% Bypass Rate
Clamping SAE features suppresses one path to harmful behavior, not the behavior itself. Models recover through the unexplained reconstruction residual.
Latent-space defenses built on Sparse Autoencoders assume that identifying and suppressing the right feature is sufficient to suppress the behavior that feature represents. That assumption is wrong. Suppressing a feature and suppressing a behavior are two different operations, and conflating them is the source of a significant false safety guarantee.
The mechanism behind this failure has a precise name: post-intervention recovery. After a clamp is applied to a targeted SAE feature, the residual stream still contains unexplained signal, the component the SAE decomposition did not account for. Starting from that post-intervention residual state, an optimizer finds perturbations that restore the original behavior while keeping the clamped feature's value stable. The constraint is strict: the defended feature must not drift back toward its pre-intervention value. Recovery happens anyway. For single-layer interventions, encoder-orthogonal updates ensure the perturbation cannot simply reverse the clamp. For cross-layer settings, the feature-map Jacobian enforces the same constraint across layers. In both cases, the model finds an alternative path through the reconstruction residual, the part of the activation space the SAE leaves unexplained.
The numbers are direct. Across refusal-steering experiments, recovery reaches 95.8% on valid samples, with defended-feature relative drift held to 0.131, well below suffix-based attack baselines. Attribution analysis localizes the recovery path specifically to the SAE reconstruction residual, not to other features or to the clamped feature recovering. The SAE decomposition was never complete enough to give the intervention full behavioral coverage. For safety engineers and teams shipping latent-space defenses, the takeaway is direct: feature-level control and behavioral control are not the same thing, and treating SAE interventions as behavioral guarantees introduces a blind spot that adversarial optimization can find reliably.
We're thinking: We find the reconstruction residual attribution the most consequential result here. SAE coverage is never 100%, and the unexplained residual is not a small rounding error. It is a structured activation subspace large enough to carry full behavioral recovery. This means the failure is not a bug in a specific SAE implementation. It is a property of the decomposition paradigm itself. Any defense that clamps features without also constraining the reconstruction residual is leaving a recoverable gap open. Teams treating SAE-based monitoring as a sufficient safety layer should reframe it as a necessary but incomplete one, and start asking what controls, if any, can be placed on the residual component before deploying these systems in safety-critical contexts.
Key takeaways:
- SAE feature clamping suppresses one activation path to a behavior; the model recovers through the reconstruction residual, the portion of the activation space the SAE decomposition does not explain.
- Recovery hits 95.8% on refusal-steering samples with defended-feature drift at 0.131, confirmed across TPP, unlearning, IOI, and refusal-steering settings; caveat is that recovery requires optimization access to the residual stream, which may not reflect all deployment threat models.
- Teams shipping SAE-based safety interventions should audit whether their threat model accounts for the reconstruction residual and treat feature-level clamping as a monitoring signal rather than a behavioral guarantee.
Source: SAE Interventions are Unreliable