CoT Fine-Tuning Quietly Destroys Long-Context Recall in Hybrid LLMs

Chain-of-thought SFT collapses Needle-In-A-Haystack retrieval in hybrid linear-attention models, and a training-free QK weight restore fixes it.

Fine-tuning for reasoning is supposed to make a model more capable. In hybrid linear-attention models, it does, but only in one direction. Chain-of-thought supervised fine-tuning systematically breaks long-range recall, and the damage compounds as context windows grow.

The mechanism is specific. CoT-SFT training data is dense with short, sequential reasoning steps: premise, inference, conclusion, repeat. That pattern biases attention gradients toward local token interactions. The query-key projections ($W_Q$ and $W_K$) responsible for routing attention across long distances get quietly overwritten to favor nearby context. Full-attention transformers can partially compensate through redundancy across layers; hybrid models with linear-attention components cannot. Their long-range routing lives almost entirely in those projection weights, so when CoT-SFT shifts them, the long-context signal has nowhere to fall back to.

QK-Restore addresses this without any additional training. After CoT-SFT completes, the method discards only $W_Q$ and $W_K$, replacing them with the pre-SFT versions while keeping every other updated parameter: feed-forward weights, value projections, layer norms, everything that carries the reasoning improvements. A Procrustes-aligned variant goes one step further, rotating the restored weights into the post-SFT parameter space so that routing preservation and reasoning adaptation are balanced rather than traded off against each other.

HypeNet-9B on NIAH-S2@256K drops from 67.2% to 9.4% after CoT-SFT. QK-Restore on HypeNet-5B brings S3@256K from 65.4% up to 76.4%, exceeding the pre-SFT baseline, while reasoning scores hold. The pattern holds across HypeNet and Jet-Nemotron architectures. For teams deploying hybrid linear-attention models with CoT fine-tuning in their pipeline, the takeaway is direct: check your long-context retrieval benchmarks after every SFT run, and treat QK-Restore as a zero-cost recovery step before any production deployment.

We're thinking: The deeper issue here is that CoT-SFT degradation is invisible unless you specifically test for it. A team fine-tuning HypeNet for reasoning would see reasoning scores improve, declare success, and ship, never knowing that long-context retrieval had collapsed from 67% to 9%. We think this points to a gap in standard fine-tuning eval suites: Needle-In-A-Haystack at multiple context lengths should be a default regression check after any SFT, not an optional diagnostic. The QK-Restore fix is elegant precisely because it confirms the failure is localized, but the more actionable lesson is that hybrid architectures have weight-level capability dependencies that standard benchmarks do not surface on their own.

Key takeaways:

CoT-SFT biases $W_Q$ and $W_K$ gradients toward short-range patterns, destroying the long-range routing that hybrid linear-attention models depend on for extended context retrieval.
HypeNet-9B NIAH-S2@256K falls from 67.2% to 9.4% post-SFT; QK-Restore recovers and exceeds baselines at zero training cost, though results are currently validated on HypeNet and Jet-Nemotron only.
Teams fine-tuning hybrid linear-attention models for reasoning should add NIAH retrieval at maximum context length to their post-SFT eval suite and run QK-Restore before any deployment.

Source: Attention Amnesia in Hybrid LLMs