A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding

Yue Zhang; Liqiang Jing; Jia Li; Yapeng Tian; Xinya Du; Yunhui Guo; Vibhav Gogate

arXiv:2603.14733·cs.CV·March 17, 2026

A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding

Yue Zhang, Liqiang Jing, Jia Li, Yapeng Tian, Xinya Du, Yunhui Guo, Vibhav Gogate

PDF

Open Access

TL;DR

This paper introduces MVX-Bench, a comprehensive multi-video understanding benchmark, and SAMA, a skill-augmented framework that enhances reasoning across videos, outperforming existing models and addressing key limitations in multi-video analysis.

Contribution

The paper presents a new unified benchmark for multi-video reasoning and a novel agentic framework that incorporates skills and conflict resolution for improved multi-video understanding.

Findings

01

SAMA outperforms baseline models on MVX-Bench.

02

Skill design and conflict resolution improve reasoning accuracy.

03

Benchmark covers diverse real-world multi-video tasks.

Abstract

Multimodal Large Language Models have achieved strong performance in single-video understanding, yet their ability to reason across multiple videos remains limited. Existing approaches typically concatenate multiple videos into a single input and perform direct inference, which introduces training-inference mismatch, information loss from frame compression, and a lack of explicit cross-video coordination. Meanwhile, current multi-video benchmarks primarily emphasize event-level comparison, leaving identity-level matching, fine-grained discrimination, and structured multi-step reasoning underexplored. To address these gaps, we introduce MVX-Bench, a Multi-Video Cross-Dimension Benchmark that reformulates 11 classical computer vision tasks into a unified multi-video question-answering framework, comprising 1,442 questions over 4,255 videos from diverse real-world datasets. We further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis