TL;DR
SYNCR is a synthetic, multi-video reasoning benchmark designed to evaluate and diagnose the reasoning abilities of multimodal large language models across various tasks, revealing significant gaps compared to human performance.
Contribution
The paper introduces SYNCR, a novel synthetic multi-video reasoning benchmark with verified grounding, enabling precise evaluation of models' reasoning capabilities across multiple diagnostic tasks.
Findings
Current models achieve only 52.5% accuracy, far below human baseline of 89.5%.
Models excel at temporal ordering but struggle with physical and spatial reasoning.
Parameter scaling and specialized training improve temporal alignment but not fine-grained physical tracking.
Abstract
Multimodal Large Language Models (MLLMs) have made rapid progress in single-video understanding, yet their ability to reason across multiple independent video streams remains poorly understood. Existing multi-video benchmarks rely largely on human-annotated real-world footage, limiting the precision of spatial, temporal, and physical ground truth and making it difficult to diagnose model failures. We introduce SYNCR, a controlled synthetic benchmark for cross-video reasoning with programmatically verified grounding. Built using Habitat, Kubric, and CLEVRER simulator engines, SYNCR contains 8,163 multi-video question-answer pairs grounded in 9,650 unique videos. It evaluates MLLMs across eight tasks spanning four diagnostic pillars: Temporal Alignment, Spatial Tracking, Comparative Reasoning, and Holistic Synthesis. Our zero-shot evaluation of leading open- and closed-weight MLLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
