← All brief issues
§ BriefApr 18, 2026 · Issue 31 · Worth Reading

Reward models that explain themselves outperform those that just score

Reward models for image generation now produce detailed written critiques instead of just numerical scores, enabling them to guide image refinement rather than merely rank outputs. This makes AI image generators improvable without retraining—a critique-based feedback loop instantly produces better images by revising prompts based on the model's own reasoning about what went wrong.

Visual generation reward models have always been bottlenecked by the same design decision: collapsing human judgment into a single number and discarding the reasoning. That score tells a generator what to optimize but provides no information about why one output surpasses another. This results in a passive evaluator that can rank outputs but cannot guide revision.

RationalRewards changes the reward model's role by training it to produce structured, multi-dimensional critiques before emitting a score. RationalRewards acts as an editor marking up a draft; this markup, rather than a simple verdict, becomes the optimization signal. This unlocks two separate gains. At training time, the structured rationales give RL (Reinforcement Learning) fine-grained, interpretable reward signals instead of a scalar to chase blindly. At test time, a Generate-Critique-Refine loop uses the critique to rewrite the prompt and regenerate, improving output quality with zero parameter updates. The same model that scores also repairs.

The hard part is achieving this without expensive human rationale annotations. Preference-Anchored Rationalization (PARROT) solves this by recovering high-quality reasoning traces from existing preference data — pairs of ranked outputs that already exist in abundance — rather than requiring annotators to write critiques from scratch. PARROT anchors the rationalization process to known preference orderings, ensuring the generated explanations are consistent with ground-truth human judgments and are not merely plausible but incorrect post-hoc stories. The reward model learns to reason about quality in the way the preference data implies humans reason about it.

Performance numbers from the full paper (beyond what the abstract reports) show gains at both training and test time across standard visual generation benchmarks, with the test-time refinement loop delivering meaningful quality improvements on prompt-following and aesthetic dimensions without touching model weights. The limitation is real: rationale quality depends on how well PARROT can reconstruct reasoning from pairwise preferences alone. When preference data is sparse or noisy in a specific quality dimension, the recovered rationales for that dimension are less reliable, and the critique loop loses precision exactly where it would be most beneficial.

For teams building image generation pipelines, the test-time loop is the immediately deployable piece. A critique-based prompt revision step requires no retraining and no infrastructure changes; it acts as a wrapper around your existing generator. The training-time signal is the longer-term investment, relevant for teams fine-tuning generators with RL who currently use scalar aesthetic or CLIP (Contrastive Language-Image Pre-training) scores as reward.

Key takeaways:

  • Reward models trained to produce structured critiques before scoring serve as both fine-grained RL training signals and prompt-revision engines at inference, effectively offering one model with two optimization pathways.
  • PARROT recovers rationale supervision from existing preference pairs, removing the annotation bottleneck that has blocked this approach; rationale quality degrades where preference data is thin.
  • Teams running RL fine-tuning on image generators should evaluate replacing scalar reward models with critique-generating ones; teams not doing RL can still capture test-time gains from the Generate-Critique-Refine loop with no retraining.

Source: RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time