SFT Before RL Is Actively Hurting Your Multimodal Model

PRISM inserts a black-box on-policy distillation stage between SFT and RLVR, closing distributional drift without white-box model access and lifting accuracy by up to +6 points.

The standard multimodal post-training recipe treats SFT as a neutral warm-up: collect demonstrations, fine-tune, then hand the model off to reinforcement learning. PRISM finds that assumption is wrong. SFT doesn't just fail to prepare a model for RL. It actively degrades the distribution the model needs to learn from, and the degradation compounds once RL begins.

The mechanism behind this is specific to multimodal settings. In text-only models, distributional drift from SFT is a known friction point, but it's largely uniform. In multimodal models, perception errors and reasoning failures drift in distinct, independent patterns. A model that hallucinates a visual detail and a model that makes a flawed inference step are making different kinds of mistakes, but SFT treats both as the same distribution gap. When RL then tries to correct from that already-drifted starting point, it's correcting from a broken prior. The errors don't cancel. They compound.

PRISM adds a third stage between SFT and RLVR: an explicit distribution-alignment step built on on-policy distillation. The policy generates its own responses, and a Mixture-of-Experts discriminator, with separate experts for perception and reasoning, evaluates them. That discriminator acts as an adversarial signal source, steering the policy back toward the supervision distribution without requiring access to teacher logits. This is the key design choice: the whole system runs as a black-box, response-level game. No white-box model access needed. The discriminator doesn't need to see inside the teacher. It just needs to distinguish the policy's outputs from the target distribution's outputs, and it does that with dedicated experts that keep perception and reasoning signals disentangled.

The data side matters too. SFT initialization used 1.26M public demonstrations, which is sufficient for broad coverage. But distribution alignment requires higher-fidelity supervision than public data provides. PRISM curates 113K additional demonstrations from Gemini 3 Flash, specifically targeting the hardest unsolved problems in the benchmark suite, with dense visual grounding and step-by-step reasoning chains. That curation decision reflects a principle worth noting: data quality requirements change at each stage. Volume works for initialization. Precision is what alignment needs.

On Qwen3-VL, PRISM lifts average accuracy by +4.4 points on the 4B model and +6.0 points on the 8B model over the SFT-to-RLVR baseline, across GRPO, DAPO, and GSPO. The gains hold across diverse multimodal benchmarks, and the pipeline is algorithm-agnostic: swap the RL algorithm and the alignment stage still contributes. Code, data, and checkpoints are public. For teams running multimodal RL post-training pipelines, the takeaway is direct: the gap between your SFT checkpoint and your RL starting point is not a neutral handoff, and treating it as one is leaving measurable accuracy on the table.

We're thinking: The uncomfortable implication here is that the field has been benchmarking SFT-to-RLVR pipelines against a broken baseline. If SFT drift is systematically compounding perception and reasoning errors before RL even starts, then reported RL gains in multimodal settings may be significantly understated, and the comparison point is contaminated. PRISM's black-box design is also worth attention beyond its accuracy numbers: requiring no teacher logit access means this alignment stage can be dropped into pipelines that use proprietary or closed teachers, which describes most production multimodal setups. The 113K high-fidelity demonstrations from Gemini 3 Flash are a separate, practical contribution, and we suspect that curation methodology, targeting hardest unsolved problems with grounded reasoning chains, may prove more reusable than the architecture itself.

Key takeaways:

PRISM inserts an on-policy distillation alignment stage between SFT and RLVR, using a MoE discriminator with separate perception and reasoning experts to produce disentangled corrective signals without requiring white-box teacher access.
On Qwen3-VL, PRISM delivers +4.4 and +6.0 average accuracy points over the SFT-to-RLVR baseline on 4B and 8B models respectively, across three RL algorithms; caveat is that results are on a single model family and the 113K alignment dataset requires a capable closed teacher to reproduce.
Teams running multimodal RLVR post-training should audit their SFT-to-RL handoff as an active failure point and consider inserting an explicit distribution-alignment stage before RL begins.

Source: PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal RL