RLVR Fine-Tuning Is Geometrically Wasteful: Rank-1 Extrapolation Matches Full Training

RELEX shows RLVR weight trajectories are rank-1 and near-linear, letting teams extrapolate full-run checkpoints from just 15% of training steps.

The standard assumption in reasoning fine-tuning is that reinforcement learning with verifiable rewards requires sustained training to accumulate meaningful parameter updates. The geometry of those updates has been mostly ignored. It turns out the geometry is almost embarrassingly simple.

RLVR training modifies model weights along a trajectory that is, in practice, rank-1. The full matrix of parameter deltas across training steps collapses onto a single dominant direction, and the magnitude of the model's position along that direction grows near-linearly with the number of steps. That is not a loose approximation. A rank-1 projection of the weight delta captures the majority of downstream performance, and adding higher-rank components yields no measurable gain in extrapolation quality.

RELEX (Reinforcement Learning Extrapolation) turns this observation into a practical method. Given a short prefix of RLVR training, it estimates the rank-1 subspace via singular value decomposition of the observed weight deltas, then fits a simple linear regression over the scalar magnitudes across steps. No learned model is required. Extrapolating to a future checkpoint is matrix arithmetic: take the estimated subspace direction, scale by the linearly projected magnitude, add back to the base weights. The method also functions as a denoising filter. Stochastic gradient noise that accumulates during real optimization is discarded when updates are projected onto the rank-1 subspace, which is part of why extrapolated checkpoints sometimes outperform the checkpoints that would have been produced by continued training.

Across Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base, RELEX checkpoints match or exceed full RLVR performance on both in-domain and out-of-domain benchmarks using as few as 15% of total training steps as the observation window. The extrapolation range is not modest: observing only the first 50 steps and extrapolating to 1000 steps, a 20x projection, still yields continued improvement. For teams running reasoning fine-tuning at any scale, the takeaway is direct: the compute budget for RLVR may be reducible by roughly an order of magnitude without sacrificing final model quality.

We're thinking: We find the denoising result more consequential than the efficiency headline. If projecting onto a rank-1 subspace discards noise that would otherwise degrade performance, it implies that a meaningful fraction of what full RLVR training is doing is fighting its own stochastic updates rather than learning. Labs spending millions of GPU-hours on reasoning fine-tuning may be paying primarily for noise cancellation that a linear projection can replicate for free. The practical ceiling here is unknown: RELEX is validated on models up to 8B parameters, and whether the rank-1 geometry holds at 70B or 400B is an open question worth answering before drawing budget conclusions at frontier scale.

Key takeaways:

RLVR weight trajectories are geometrically rank-1 and near-linear in magnitude across training steps, meaning the full training run lives in a one-dimensional subspace that can be identified early and extrapolated with linear regression.
RELEX matches or exceeds full RLVR performance across three Qwen models using 15% of training steps as input, with validated extrapolation up to 20x beyond the observation window; the main caveat is that experiments top out at 8B parameters, leaving frontier-scale behavior unconfirmed.
Teams running RLVR fine-tuning for math or reasoning tasks should test RELEX as a checkpoint-generation layer before committing to full training runs, particularly when iterating on base models or reward functions where full runs would otherwise be required for each configuration.

Source: You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories