LoRA Isn't the Default for Hybrid Models Anymore
For hybrid models combining recurrence and attention, tuning the recurrent layer's initial hidden state outperforms the standard LoRA approach by 10–24 percentage points on code tasks, using zero extra parameters and requiring no weight merging at deployment. This method works for narrow, data-scarce problems on models like Qwen and Falcon but doesn't apply to standard transformers or transfer to text-to-SQL tasks.
Fine-tuning a code model typically involves LoRA (Low-Rank Adaptation), which injects trainable rank-decomposition matrices into attention weights, keeps everything else frozen, and merges at inference. This approach assumes weight adaptation is the sole effective method. In contrast, S0 tuning entirely ignores the weights. Instead, it optimizes a single initial state matrix per recurrent layer, representing the hidden state the model starts from before reading its first token. On roughly 48 execution-verified HumanEval training solutions — a dataset small enough to fit in a spreadsheet — S0 tuning outperforms LoRA by +10.8 percentage points (p < 0.001) on HumanEval pass@1.
This mechanism applies specifically to hybrid recurrent-attention architectures, such as Qwen3.5-4B (GatedDeltaNet hybrid) and FalconH1-7B (Mamba-2 hybrid), which feature recurrent layers alongside attention layers. These recurrent layers maintain a compressed state that propagates information across the sequence. S0 tuning treats that initial state as a learnable prior — a soft, persistent context baked into the model before inference — without altering any weight matrix. On Qwen3.5-4B, greedy pass@1 improves by +23.6 ± 1.7 pp across 10 seeds. On FalconH1-7B, S0 reaches 71.8% ± 1.3 compared to LoRA's 71.4% ± 2.4 (3 seeds). While statistically indistinguishable there, S0 offers a concrete operational advantage: it requires no weight merging step at deployment. Cross-domain transfer holds on math reasoning — MATH-500 gains +4.8 pp (p = 0.00002) and GSM8K gains +2.8 pp (p = 0.0003) after tuning only on code. However, it does not transfer to the Spider text-to-SQL benchmark, which suggests the learned initial state encodes general reasoning posture more than task-specific syntax.
The method's hard limitation is its scope: results come from only two hybrid models. Pure-attention transformers have no recurrent state matrix to optimize, so S0 tuning does not apply to them. The text-to-SQL non-transfer further indicates that the initial state is not a universal adapter, underscoring the importance of domain distance. Still, for teams deploying hybrid models on narrow, high-value tasks with minimal labeled data, S0 tuning presents a distinct advantage over LoRA: it involves fewer parameters to tune, zero inference overhead, no post-training merge, and competitive or superior accuracy on 48 examples.
Key takeaways:
- Optimizing the recurrent layer's initial hidden state — not any weight matrix — creates a learned prior that shifts model behavior before the first token is processed, with zero added inference cost.
- A +23.6 pp pass@1 gain on 48 training examples, with cross-domain math transfer but no text-to-SQL transfer, implies the initial state encodes general reasoning posture instead of task-specific patterns.
- Teams fine-tuning hybrid recurrent-attention models (GatedDeltaNet, Mamba-2) for code or reasoning tasks should benchmark S0 tuning against LoRA, particularly in low-data, latency-sensitive deployments, rather than automatically choosing weight adaptation.
Source: S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models