Video Agents That Decide What to Watch Before Watching It

EVA lets video AI systems decide which frames matter before processing them, using reinforcement learning to develop adaptive viewing strategies instead of watching everything uniformly. This cuts wasted computation on long videos—a critical bottleneck for any team building video understanding systems at scale.

Video Agents That Decide What to Watch Before Watching It

Most video understanding systems process every frame using uniform sampling and send the full token sequence to the model. This method works for short clips. However, on long videos, the token count explodes, temporal dependencies become buried in redundancy, and performance degrades precisely when it is most needed.

EVA (Efficient Reinforcement Learning for End-to-End Video Agent) reverses this process. Rather than perceiving first and then reasoning, it plans before perceiving. The agent executes an iterative summary-plan-action-reflection loop: it reads a lightweight summary, decides which frames to attend to, acts on that decision, and then updates its plan based on its findings. The model determines what it needs to watch before committing any compute to watching it.

The training mechanism is crucial. EVA uses end-to-end RL (Reinforcement Learning) to learn this planning behavior directly from task reward, without a manually designed workflow or hand-crafted retrieval heuristics. Prior agent-based video methods attach external tools to a passive MLLM (Multimodal Large Language Model) backbone, which means their perception strategy is fixed at design time. EVA's RL training allows the agent to develop its own adaptive viewing strategy through trial and error. The reflection step completes the loop: after acting, the agent revises its plan before the next iteration, compressing multi-step temporal reasoning into a single end-to-end learned process.

The abstract does not present benchmark numbers, so the claimed efficiency gains over uniform sampling and tool-augmented baselines require verification in the full paper. The framing — planning-before-perception as an RL-learned behavior — represents the structural contribution worth evaluating. If the numbers hold, this changes how one thinks about video agent architecture: the bottleneck lies not in model capacity on long videos, but in whether the model has agency over what it processes.

For teams developing video understanding pipelines, the practical implication is straightforward: if a system processes the full frame sequence regardless of query type, it incurs a compute tax that scales with video length. An adaptive selection layer, even a simple one, would likely recover most of that cost.

Key takeaways:

EVA uses end-to-end RL to learn planning-before-perception, allowing the agent to dynamically select which frames to process instead of uniformly consuming the full token sequence.
While passive MLLMs with external tools have a fixed perception strategy at design time, RL training enables the strategy to adapt to query and content during inference.
Teams developing long-video understanding pipelines should evaluate if their frame sampling strategy is query-adaptive; fixed uniform sampling represents an immediate cost to reduce.

Source: EVA: Efficient Reinforcement Learning for End-to-End Video Agent