Most Tokens in a Correct Response Are Getting the Wrong Credit Signal

DelTA shows RLVR's policy-gradient update acts as an implicit token-level discriminator, then fixes the distortion it creates , gaining 3.26 points on math benchmarks.

Response-level rewards feel like a clean abstraction: the model produces a chain of thought, gets a score, updates. The assumption buried inside that abstraction is that the reward signal distributes sensibly across the tokens that produced it. DelTA breaks that assumption open and shows the distribution is actively distorted.

The mechanism is geometric. A standard policy-gradient update for RLVR implicitly constructs two centroids in gradient space: one from tokens in high-reward responses (weighted by positive advantages), one from tokens in low-reward responses (weighted by negative advantages). The update direction is the vector pointing from the negative centroid to the positive one. That direction is, structurally, a linear discriminator over token-gradient vectors. It decides which token probabilities go up and which go down. The problem is that both centroids are dominated by high-frequency, low-information tokens: formatting markers, punctuation, connective phrases that appear in every response regardless of whether it's correct. Those shared tokens pull both centroids toward the same region of gradient space. The resulting discriminator direction is dull. It amplifies common patterns and dilutes the sparse, contrastive directions that actually separate good reasoning from bad.

DelTA estimates per-token coefficients that amplify side-specific gradient directions and downweight shared or weakly discriminative ones. The coefficients reweight the standard RLVR surrogate objective so that the effective positive and negative centroids are more contrastive: tokens that appear disproportionately in correct responses get more influence over the update direction, and tokens that appear uniformly across correct and incorrect responses get less. No new reward model is required. No token-level annotation. The reweighting operates directly on the gradient geometry, which means it slots into existing RLVR pipelines without architectural changes.

On seven mathematical benchmarks, DelTA beats the strongest same-scale baselines by 3.26 average points on Qwen3-8B-Base and 2.62 average points on Qwen3-14B-Base. The gains hold on code generation, transfer to a different backbone, and survive out-of-domain evaluation. For teams running RLVR fine-tuning on reasoning tasks, the takeaway is direct: the reward signal you think you are applying to your model is being systematically diluted by formatting tokens before it ever reaches the weights that matter.

We're thinking: We find the discriminator framing more useful than the performance numbers. Most RLVR debugging today focuses on reward quality: is the verifier accurate, is the prompt well-formed, is the advantage estimate stable. DelTA suggests there is a prior failure mode that precedes all of those: even a perfect response-level reward gets smeared across tokens in a way that privileges surface patterns over reasoning structure. That means teams who have already invested in better verifiers or more careful prompt engineering may be leaving gains on the table not because their reward signal is wrong, but because the gradient aggregation step is washing it out. The fix is cheap relative to the problem it solves, which makes the diagnostic lens here arguably more durable than any single benchmark number.

Key takeaways:

DelTA reframes the RLVR policy-gradient update as a linear discriminator over token-gradient vectors, then corrects the centroid distortion caused by high-frequency shared tokens by reweighting the surrogate objective with estimated per-token discriminativity coefficients.
Gains of 3.26 and 2.62 average points over same-scale baselines on seven math benchmarks, with generalization confirmed on code and out-of-domain tasks; caveat is that evaluations are limited to Qwen3-class models and standard math/code verifiable-reward settings.
Teams doing RLVR fine-tuning for reasoning should audit whether formatting and connective tokens are dominating their gradient signal before scaling compute or improving verifiers, and treat DelTA-style token reweighting as a low-cost diagnostic and training intervention.

Source: DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards