AI-generated videos are too consistent — and that's exactly how to catch them

AI-generated videos betray themselves through eerie consistency—their frames correlate with each other far too predictably because they're anchored to a fixed prompt, while real video naturally accumulates random camera shake and lighting flicker. A new detection method exploits this temporal fingerprint across entire videos rather than hunting for glitches in individual frames, making it resilient to improvements in generation quality.

Real videos are messy, with camera shake, lighting flicker, and subtle motion randomness; natural footage accumulates stochastic variation across frames. AI-generated videos do not, as they are anchored to a prompt. This deterministic anchor leaves a structural fingerprint that existing detectors largely ignore.

Current AIGV (AI-Generated Video) detectors hunt for localized artifacts or short-term frame inconsistencies. That works until generation quality improves enough to suppress visible glitches. ATSS (Anomalous Temporal Self-Similarity) targets something deeper: the global temporal structure of how frames relate to each other across the full video. Real videos show irregular, noisy self-similarity patterns over time. Generated videos show unnaturally repetitive correlations, meaning semantic content and visual texture stay too stable, too consistently, for too long.

The mechanism is straightforward. Because a generated video is conditioned on a fixed prompt (text or image), the generative process produces frames that keep returning to that anchor. This creates anomalously high self-similarity scores across both visual and semantic domains, a pattern that persists even when individual frames look photorealistic. ATSS uses this with a multimodal detection framework that computes temporal self-similarity matrices across visual features and semantic representations, then looks for the characteristic over-regularized structure that natural footage never produces. The fingerprint is in how frames echo each other, not in any single frame.

The limitation is real: as generators improve their ability to inject synthetic noise and variability, which is directly adversarial to this signal, the gap between generated and natural self-similarity distributions will narrow. Detection methods targeting artifacts have already cycled through this arms race. ATSS buys time by targeting generative logic rather than generative mistakes, but it's not immune to the same pressure.

For teams building content moderation pipelines or digital forensics infrastructure, the practical shift here is architectural: stop analyzing frames in isolation and start computing inter-frame similarity structure. A classifier that inputs a per-frame feature stream and outputs a detection score misses the signal entirely. The signal lives in the temporal correlation matrix, not the frame embedding.

Key takeaways:

AI-generated videos exhibit anomalously high temporal self-similarity across visual and semantic domains because prompt-anchored generation produces unnaturally consistent frame sequences, and this fingerprint survives even when individual frames are photorealistic
This implies that generative quality improvements (sharper textures, cleaner motion) won't automatically defeat detection methods targeting global temporal structure, unlike artifact-based detectors that degrade as generation improves
Teams building video authenticity detection should instrument their pipelines to compute temporal self-similarity matrices across full video sequences, not just per-frame classifiers, as the discriminative signal is structural, not local

Source: ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity