Safety Signal Lives Inside the Model, Not Just at the End
SIREN probes internal LLM layers to detect harmful content, beating current guard models with 250x fewer trainable parameters.
Every major guard model deployed today reads the same thing: the final layer of the LLM it is watching. That design choice assumes safety-relevant information concentrates at the output end of the forward pass. It does not.
SIREN starts from a different observation. Safety signal is distributed across the internal layers of a language model, not pooled at the terminal one. To find it, SIREN runs linear probes across all layers to identify which individual neurons respond most strongly to harmful versus benign content. Those neurons become the detection substrate. Rather than treating all layers equally, the system then applies an adaptive layer-weighting strategy: layers that contribute more predictive signal receive more weight in the final harmfulness score. Think of it as building a distributed alarm system from components already wired into the model, rather than installing a separate camera at the exit. No modification to the underlying model is required, and the entire detection apparatus sits on top of frozen internals.
The efficiency gap that falls out of this design is not incremental. SIREN uses 250 times fewer trainable parameters than state-of-the-art open-source guard models while surpassing them across multiple benchmarks. Because detection happens on intermediate representations during the forward pass rather than after generation completes, the architecture naturally supports real-time streaming detection: harmful content can be flagged token-by-token as the model generates, not only after a full response lands. Generalization to unseen benchmarks is also stronger than existing alternatives, which matters more in production than benchmark-specific tuning. For teams running content moderation pipelines on LLM outputs, the takeaway is direct: a lightweight probe on internal representations can outperform a dedicated large guard model at a fraction of the inference cost.
We're thinking: We find the layer-weighting result worth sitting with. Guard models built on terminal-layer representations are not just leaving signal on the table: they are also making a structural bet that harmful intent is only legible once the model has finished its computation. SIREN's probing results suggest the opposite: safety-relevant features activate earlier in the forward pass, which means detection can happen before generation completes. That has a concrete implication for streaming APIs and real-time moderation. The open question is adversarial fragility. Probes trained on internal representations could be gamed by fine-tuning the base model to suppress those neurons, a threat that terminal-layer guard models share but that may manifest differently here.
Key takeaways:
- SIREN identifies safety neurons via linear probing across all internal layers, then combines them with adaptive layer weighting to build a harmfulness detector entirely from frozen LLM internals.
- SIREN surpasses current open-source guard models on multiple benchmarks and generalizes better to unseen ones, using 250x fewer trainable parameters; caveat is that adversarial robustness against base-model fine-tuning attacks remains untested.
- Teams shipping LLM content moderation should evaluate internal-representation probing as a replacement for generative guard models, especially where latency or inference cost is a constraint.
Source: LLM Safety From Within: Detecting Harmful Content with Internal Representations