SPLICE Benchmark: VLMs Score 51% on Temporal Reasoning

SPLICE Benchmark Results

The SPLICE benchmark tests temporal, causal, spatial, contextual, and common sense reasoning by having models reconstruct the correct sequence of shuffled video clips. The research, co-authored by the source poster, was published at EMNLP 2025.

Model Performance Details

Tested models included Gemini Flash (1.5 and 2.0), Qwen2-VL (7B and 72B), InternVL2.5, and LLaVA-OneVision. Gemini 2.0 Flash scored 51% on the vision-only task, while human performance was 85%. Open-source models struggled significantly:

LLaVA-OneVision-72B scored barely above random guessing in vision-only setting
InternVL2.5-78B performed similarly poorly
Qwen2-VL-72B reached only around 30% on vision-only
Qwen2-VL-7B performed on par with the 72B variant, suggesting scaling the language model doesn't help when the bottleneck is in the vision encoder

Language Prior Dependency

When human-written text annotations describing clip content were added, model performance jumped significantly while human performance remained unchanged. This indicates models rely on language priors to compensate for weak visual understanding. Notably, Qwen2-VL-72B outperformed Gemini on text-only reasoning.

Visual Shortcut Behavior

Models demonstrated problematic reasoning patterns. When first and last video clips looked visually similar (like opening and closing a printer door), models predicted those clips were adjacent 57% of the time, compared to 2.5% for humans and 27% random chance. This suggests models are pattern matching on visual similarity rather than reasoning about events.

Testing Limitations and Future Work

The research didn't test Claude (which doesn't support video input) or OpenAI models (which couldn't handle multi-video input reliably at testing time). The dataset is public, and the poster notes newer models like Gemini 3 Flash and Qwen3-VL (with native 256K interleaved context, enhanced spatial-temporal modeling, and MoE variants up to 235B) should be tested on SPLICE to see if language prior issues persist. Preliminary testing suggests the language prior problem remains, though statistical significance hasn't been established across all experimental samples.

📖 Read the full source: r/LocalLLaMA