SPLICE Benchmark Reveals VLMs Struggle with Temporal Reasoning, Rely on Language Priors

SPLICE Benchmark Results
The SPLICE benchmark tests temporal, causal, spatial, contextual, and common sense reasoning by having models reconstruct the correct sequence of shuffled video clips. The research, co-authored by the source poster, was published at EMNLP 2025.
Model Performance Details
Tested models included Gemini Flash (1.5 and 2.0), Qwen2-VL (7B and 72B), InternVL2.5, and LLaVA-OneVision. Gemini 2.0 Flash scored 51% on the vision-only task, while human performance was 85%. Open-source models struggled significantly:
- LLaVA-OneVision-72B scored barely above random guessing in vision-only setting
- InternVL2.5-78B performed similarly poorly
- Qwen2-VL-72B reached only around 30% on vision-only
- Qwen2-VL-7B performed on par with the 72B variant, suggesting scaling the language model doesn't help when the bottleneck is in the vision encoder
Language Prior Dependency
When human-written text annotations describing clip content were added, model performance jumped significantly while human performance remained unchanged. This indicates models rely on language priors to compensate for weak visual understanding. Notably, Qwen2-VL-72B outperformed Gemini on text-only reasoning.
Visual Shortcut Behavior
Models demonstrated problematic reasoning patterns. When first and last video clips looked visually similar (like opening and closing a printer door), models predicted those clips were adjacent 57% of the time, compared to 2.5% for humans and 27% random chance. This suggests models are pattern matching on visual similarity rather than reasoning about events.
Testing Limitations and Future Work
The research didn't test Claude (which doesn't support video input) or OpenAI models (which couldn't handle multi-video input reliably at testing time). The dataset is public, and the poster notes newer models like Gemini 3 Flash and Qwen3-VL (with native 256K interleaved context, enhanced spatial-temporal modeling, and MoE variants up to 235B) should be tested on SPLICE to see if language prior issues persist. Preliminary testing suggests the language prior problem remains, though statistical significance hasn't been established across all experimental samples.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude AI Recovers 11-Year-Old Bitcoin Wallet Worth $400K by Finding Backup and Fixing Brute-Force Bug
A user recovered a 5 BTC wallet (worth ~$400K) after 11 years by feeding their entire college computer files into Claude. The AI found an older backup wallet and identified a bug in btcrecover's password combination logic, enabling successful decryption.

AMD Ryzen AI NPUs Gain Linux LLM Support via Lemonade 10.0 and FastFlowLM
AMD Ryzen AI NPUs now support running large language models on Linux through Lemonade 10.0 server with FastFlowLM runtime, requiring Linux 7.0 kernel or AMDXDNA driver back-ports.

AI Vendor Lock-In Escalates: Switching Models Now Costs More Than Most Expected
A Zapier survey of 542 US executives shows 90% thought they could switch AI vendors in under 4 weeks, but 58% of actual migrations failed or took far longer. Meanwhile, OpenAI raised GPT-5.2 input token pricing from $1.25 to $5.75, and Anthropic moved Claude enterprise to dynamic pricing, potentially doubling or tripling costs for heavy users.

Apple's AI Strategy and the Commoditization of Intelligence
The article argues that Apple's conservative approach to AI may be advantageous as intelligence becomes commoditized, with models like Gemma4 achieving 85.2% on MMLU Pro while running on phones, and OpenAI's Sora costing $15M daily against $2.1M revenue.