Two Tokens Fix Hidden-State Recurrence: SWITCH Makes Latent Reasoning RL-Trainable

SWITCH adds discrete boundary tokens to latent chain-of-thought, making hidden-state recurrence compatible with standard on-policy RL and causally interpretable for the first time.

Latent chain-of-thought has always carried a seductive premise: compress reasoning into continuous hidden states, skip the verbose token trails, think faster. The catch is that hidden-state recurrence breaks standard on-policy reinforcement learning. The policy ratio that GRPO and its relatives depend on becomes undefined when reasoning steps are invisible discrete decisions, so teams pursuing latent reasoning have had to build custom optimization scaffolding around a fundamentally awkward training target.

SWITCH sidesteps the entire problem with a minimal structural change. The model emits a single token, <swi>, to enter latent mode and a closing token, </swi>, to exit. Because those boundaries are ordinary discrete tokens, every decision point in the sequence is well-defined for the GRPO policy ratio. The latent block between them can involve arbitrary hidden-state recurrence, but the optimizer sees clean entry and exit anchors it can reason about. Think of it as giving the RL algorithm two handholds on an otherwise smooth wall: the computation inside the block stays continuous and compressed, but the training signal has a place to grip.

The same two tokens do double duty as mechanistic probes. Because <swi> and </swi> are discrete and positionally fixed, researchers can run causal interventions directly on them, patching activations, ablating the block, and tracing what changes downstream. That is not a bonus feature. It is the same design choice, reused. A visible-to-latent curriculum trains the model progressively, starting from explicit chain-of-thought and gradually shifting computation into the latent block, with Switch-GRPO propagating gradients through the recurrent steps throughout.

The mechanistic analysis produces three findings worth separating out. First, <swi> is a learned switching policy, not a stylistic artifact: ablating it degrades performance in ways that confirm the token carries genuine decisional weight. Second, the latent computation between the anchors is causally important and problem-specific, not an inert placeholder that the model ignores. Third, that computation concentrates at a single hidden-state transition on entry, meaning the model does most of its latent work immediately after crossing the boundary rather than spreading it across the full block. SWITCH consistently outperforms prior hidden-state-recurrence approaches at comparable scale. For teams building or evaluating reasoning models, the takeaway is direct: latent chain-of-thought no longer requires bespoke training infrastructure to optimize with RL.

We're thinking: The interpretability angle here deserves more attention than it typically gets in efficiency-focused latent reasoning work. We now have a framework where the "invisible thinking" is not just trainable but actually inspectable: you can probe what the model encodes at the boundary, intervene on it, and measure the causal effect on outputs. That changes the risk calculus for deploying latent reasoning in production. Teams that previously avoided hidden-state approaches because they couldn't audit the reasoning process now have a concrete foothold. The more contrarian read is that concentrating computation at a single hidden-state transition on entry might become a fragility point at scale, and the curriculum design will matter enormously for how well that concentration generalizes across problem types.

Key takeaways:

Two discrete boundary tokens (<swi> / </swi>) make hidden-state recurrence compatible with standard GRPO-based on-policy RL and expose latent steps to causal probing, with no custom optimizer required.
SWITCH outperforms prior hidden-state-recurrence latent reasoning methods at similar scale; mechanistic analysis confirms the entry token carries genuine switching policy weight and that latent computation concentrates at the first hidden-state transition, not across the full block.
Teams building reasoning models with RL fine-tuning should evaluate SWITCH's boundary-token design as a drop-in path to latent chain-of-thought that keeps standard on-policy training pipelines intact.

Source: Demystifying Hidden-State Recurrence: SWITCH