Video MLLMs Fake Audio Understanding: Visual Hallucination at Scale

Every major video MLLM, including models from OpenAI and Google, substitutes visual inference for actual audio processing, a flaw now measurable and partially fixable.

Video models have been credited with audio-visual understanding. That credit is largely unearned. Across both leading open-source omni models and closed-source systems from Google and OpenAI, apparent audio comprehension turns out to be vision-driven inference: the model reads the image, predicts what sound should be there, and reports that prediction as if it had processed the audio stream.

This is not a subtle bias. It is a structural failure with a name: the audio-visual Clever Hans effect. The original Clever Hans was a horse that appeared to solve arithmetic problems but was actually reading involuntary cues from its handler. These models do the same thing with sound. They exploit the statistical correlation between visual content and expected audio, producing answers that look grounded in audio but are not verified against the actual signal. A video of a barking dog with the audio track swapped for silence, or replaced with a piano, will still elicit a confident description of barking, because the visual frame is doing all the work.

The framework built to expose this is called Thud. It applies three counterfactual audio edits to otherwise normal videos. Shift tests temporal synchronization: audio and video are offset so they no longer align in time. Mute removes the audio track entirely. Swap replaces the original audio with a semantically mismatched one. A model that genuinely processes audio should fail on all three interventions in predictable ways. A model that substitutes visual inference for audio verification will appear to succeed on Mute (it hallucinates the expected sound from the image) and will often fail to notice the mismatch in Swap (it reports what the video looks like, not what it sounds like). Thud turns this behavioral signature into a measurable diagnostic.

The fix is a two-stage alignment recipe. First, contrastive preference pairs derived from the Thud interventions teach the model to distinguish verified audio from hallucinated audio, rewarding responses that reflect the actual acoustic content. Second, general event-level video preference data regularizes the model against over-specialization, preserving performance on standard video QA tasks that do not require audio-visual conflict resolution. The second stage matters: without it, training on intervention pairs alone can degrade the model's general video capabilities.

A 10,000-sample training recipe using this approach lifts average performance across the three Thud intervention dimensions by 28 percentage points, while slightly improving scores on general video and audio-visual QA benchmarks. The gap is not incremental. For teams building products that claim audio-visual understanding, the takeaway is direct: any evaluation that does not include counterfactual audio edits is measuring vision, not audio-visual reasoning.

We're thinking: We find the Clever Hans framing more precise than it first appears. The problem is not that these models are bad at audio. It is that they are very good at vision, good enough that audio verification becomes unnecessary for passing most benchmarks. Every product today claiming audio-video understanding is, with high probability, shipping a vision-only model with a misleading label. The Thud framework now makes that testable in under an afternoon. The more uncomfortable implication is for evaluation infrastructure broadly: if visual-acoustic correlation is strong enough to fool SOTA models and existing benchmarks simultaneously, the benchmark corpus itself may be too visually predictable to surface this failure mode at all.

Key takeaways:

Apparent audio understanding in video MLLMs is a vision artifact: models exploit visual-acoustic correlation without verifying the audio stream, a failure mode Thud isolates via three counterfactual edits (Shift, Mute, Swap).
A 10K-sample two-stage alignment recipe recovers 28 percentage points across Thud's three intervention dimensions with no regression on general benchmarks; caveat: the improvement is measured on intervention-derived test sets, and real-world generalization to diverse audio-visual mismatch scenarios remains an open question.
Teams shipping any feature described as audio-visual understanding should run Thud-style counterfactual probes before release: mute the audio, swap it with a mismatched track, and check whether model outputs change accordingly.

Source: When Vision Speaks for Sound