The Safety Tool That Became a Jailbreak: GCD's Hidden Attack Surface

Grammar-constrained decoding, used to enforce code validity, suppresses LLM refusal tokens and enables a 30+ point jailbreak success rate lift.

Grammar-constrained decoding was adopted to make LLM-generated code more trustworthy. It turns out the same mechanism that enforces syntactic correctness also suppresses the refusal behavior that alignment training installed, making it one of the more reliable jailbreak vectors found in production tooling.

The mechanism is not subtle once you see it. When GCD is active, it forces the model's next-token distribution to assign zero probability to any token that would violate the grammar constraint. Safety refusals in code-generation contexts are typically expressed in natural language, and natural language tokens are structurally incompatible with a strict code grammar. The constraint doesn't argue with alignment. It simply makes refusal tokens unreachable. CodeSpear, the attack built on this insight, requires no adversarial prompt injection, no fine-tuning of the target model, and no special access. Applying a benign code grammar constraint is sufficient.

The defense, CodeShield, works by teaching the model to produce honeypot code under GCD: output that is syntactically valid, structurally diverse across grammar variations, and semantically inert. The malicious request goes unimplemented. Because the honeypot code varies structurally, an attacker cannot simply tighten the grammar to suppress it. When natural language output is available, the model still issues a conventional refusal. CodeShield preserves both modes without sacrificing benign utility.

Across 10 LLMs and 4 benchmarks, CodeSpear lifts attack success rate by more than 30 percentage points on average over representative jailbreak baselines. That is not a marginal improvement over prior methods. CodeShield recovers safety under CodeSpear while keeping benign code generation intact. For security engineers and ML platform teams deploying structured output pipelines, the takeaway is direct: any grammar-constrained code generation endpoint is a potential attack surface, and existing alignment is not sufficient protection.

We're thinking: We find the structural logic here more alarming than the headline numbers. GCD is not an exotic research technique. It is embedded in production tooling, constrained decoding libraries, and code-generation APIs that teams ship today without treating them as security-relevant components. The attack requires no prompt engineering skill. It requires knowing that the grammar constraint exists. That shifts the threat model: the relevant adversary is not a jailbreak researcher but anyone who reads the API documentation. CodeShield is a meaningful mitigation, but it requires fine-tuning each deployed model, which means teams using off-the-shelf models through third-party APIs have no immediate recourse beyond disabling GCD for sensitive endpoints entirely.

Key takeaways:

Grammar-constrained decoding bypasses LLM safety alignment by making refusal tokens structurally unreachable within the enforced grammar, not by defeating alignment logic directly.
CodeSpear achieves a 30+ percentage point average increase in attack success rate across 10 models and 4 benchmarks; CodeShield restores safety, but requires model-level fine-tuning and has not been evaluated against grammar constraints beyond those tested.
Teams shipping GCD-enabled code generation endpoints should audit those endpoints as attack surfaces and evaluate whether CodeShield fine-tuning is feasible for their deployed models, or restrict grammar-constrained generation to sandboxed, non-sensitive contexts.

Source: Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code