A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos
Mohammed Irfan Kurpath, Jaseel Muhammad Kaithakkodan, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad, Fahad Shahbaz Khan, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal

TL;DR
This paper introduces LongShOTBench, a comprehensive benchmark for evaluating multimodal reasoning and tool use in long videos, and presents LongShOTAgent, an agentic system that improves understanding of complex video content.
Contribution
It provides a new diagnostic benchmark with open-ended questions and a scalable, human-verified pipeline, along with an agentic system for long video analysis, addressing gaps in existing evaluation methods.
Findings
State-of-the-art models perform significantly below human-level on LongShOTBench.
Open-source models achieve below 30% accuracy, highlighting the challenge.
LongShOTAgent achieves 44.66% accuracy, demonstrating the potential of agentic approaches.
Abstract
Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Social Robot Interaction and HRI
