Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

Visual reasoning models now route simple questions directly to answers instead of forcing every query through lengthy multi-step reasoning chains. This adaptive approach matches accuracy while slashing compute costs on the majority of real-world questions—which turn out to be straightforward factual or perceptual tasks, not complex reasoning problems.

Visual reasoning models waste compute answering easy questions the hard way

Longer reasoning chains feel like a safe default for visual question answering. More steps, more coverage, fewer errors. Systematic ablations on VRMs (Visual Reasoning Models) tell a different story: the majority of visual questions don't need full chain-of-thought reasoning, and forcing every query through the same long-form path burns compute without improving accuracy.

Reasoning Path Redundancy causes this issue. Most visual questions require only one or two cognitive operations—perceive the image or directly retrieve an answer—yet standard VRMs run all three: visual perception, logical reasoning, and answer synthesis. AVR (Adaptive Visual Reasoning) decomposes these into three explicit response formats and trains the model to route between them dynamically. Simple factual queries receive a Direct Answer. Perceptual questions (count the objects, identify the color) receive a Perception-Only Format. Only genuinely complex multi-step tasks trigger Full Format. The model learns this routing rather than following fixed rules: FS-GRPO (Few-Shot Group Relative Policy Optimization), an adaptation of standard GRPO (Group Relative Policy Optimization, a reinforcement learning method that compares groups of sampled outputs to compute a reward signal), trains the model by rewarding correct answers with appropriately short reasoning chains and penalizing length inflation on easy queries.

The numbers show the source of savings. On standard visual reasoning benchmarks, AVR matches or beats full-chain baselines on accuracy while cutting average response length substantially on the easy-to-medium query tier, which makes up the bulk of real-world traffic. The limitation is scope: evaluations focus on static image benchmarks, and the routing behavior on ambiguous or adversarially framed questions has not been stress-tested. It remains an open question whether the model correctly identifies when a question only appears simple.

For teams running VRMs in production, the practical signal is direct: treating every query as a hard reasoning problem is a cost choice disguised as a quality choice. Routing infrastructure at the model level, rather than at an external classifier layer, keeps the decision inside the model where it has access to visual context that a pre-inference router lacks.

Key takeaways:

AVR decomposes visual reasoning into three cognitive functions and trains models via FS-GRPO to route each query to the shortest sufficient reasoning path, reducing length without sacrificing accuracy.
Most production visual query traffic sits in the perceptual and direct-answer tiers; full chain-of-thought is an exception rather than the default when the model is given a choice.
Teams deploying VRMs should evaluate per-query reasoning length distributions before assuming longer chains are necessary; adaptive routing at inference time may cut compute costs on the majority of traffic.

Source: Learning Adaptive Reasoning Paths for Efficient Visual Reasoning