Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

Video generation typically costs vastly more compute than understanding, making language-model-first architectures inefficient. Uni-ViGU flips this: starting with a video diffusion model as the foundation, it adds understanding as a lightweight add-on, letting both tasks share the generator's rich visual knowledge without architectural strain. This changes how teams can build unified systems for video captioning, QA, and generation.

Previous efforts unified video understanding and generation by always starting with a language model and adding generation. This approach was unviable due to compute asymmetry: video generation costs orders of magnitude more than understanding, so the architecture perpetually fought its own foundation. Uni-ViGU inverts this starting point.

A video diffusion model becomes the base, and understanding is the add-on. The intuition is structural: a generator already learns rich spatiotemporal representations to produce coherent video; these same representations carry more perceptual signal than the compressed text-centric features most MLLMs (Multimodal Large Language Models) build from. Understanding tasks can ride on top without the architecture straining against its own design.

The mechanism runs on two components. First, a unified flow method applies continuous flow matching (the generation process for video, which gradually transforms noise into frames) and discrete flow matching (the equivalent process for text tokens) within a single forward pass. Both modalities generate through the same process, without separate decoders. Second, a modality-driven MoE (Mixture of Experts) framework augments transformer (neural network architecture behind most modern AI) blocks with lightweight layers dedicated to text generation, keeping the video backbone intact while routing text outputs through specialized parameters. The result is a model that generates video and text coherently without inflating the core architecture.

The key limitation here is scope: the abstract is cut off, so specific benchmark numbers are not available in the source. What is structurally clear is the design bet: that generation-first gives the unified model a stronger perceptual foundation than understanding-first approaches, and that MoE routing handles the modality asymmetry without blowing up parameter count.

For teams building multimodal pipelines that need both generation and captioning or QA (Question Answering) over video, the generator-as-foundation framing is worth pressure-testing. Most current stacks treat video generation and video understanding as separate systems. If a single diffusion backbone can handle both without the compute penalty of adding generation to a language model, the infrastructure argument for unification gets significantly stronger.

Key takeaways:

Continuous and discrete flow matching run in a single process: video generates via noise-to-frame diffusion, text generates via token-space flow, unified in one forward pass with MoE routing keeping text generation lightweight.
Starting from a generator, rather than a language model, flips the compute asymmetry: rich spatiotemporal priors come for free, and understanding tasks inherit them rather than fighting against a text-first architecture.
Teams evaluating unified video-language systems should benchmark generator-first architectures against the standard MLLM-extended-to-generate approach before committing to either stack.

Source: Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator