A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding
Yue Zhang, Liqiang Jing, Jia Li, Yapeng Tian, Xinya Du, Yunhui Guo, Vibhav Gogate

TL;DR
This paper introduces MVX-Bench, a comprehensive multi-video understanding benchmark, and SAMA, a skill-augmented framework that enhances reasoning across videos, outperforming existing models and addressing key limitations in multi-video analysis.
Contribution
The paper presents a new unified benchmark for multi-video reasoning and a novel agentic framework that incorporates skills and conflict resolution for improved multi-video understanding.
Findings
SAMA outperforms baseline models on MVX-Bench.
Skill design and conflict resolution improve reasoning accuracy.
Benchmark covers diverse real-world multi-video tasks.
Abstract
Multimodal Large Language Models have achieved strong performance in single-video understanding, yet their ability to reason across multiple videos remains limited. Existing approaches typically concatenate multiple videos into a single input and perform direct inference, which introduces training-inference mismatch, information loss from frame compression, and a lack of explicit cross-video coordination. Meanwhile, current multi-video benchmarks primarily emphasize event-level comparison, leaving identity-level matching, fine-grained discrimination, and structured multi-step reasoning underexplored. To address these gaps, we introduce MVX-Bench, a Multi-Video Cross-Dimension Benchmark that reformulates 11 classical computer vision tasks into a unified multi-video question-answering framework, comprising 1,442 questions over 4,255 videos from diverse real-world datasets. We further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
