LongVideoAgent: Multi-Agent Reasoning with Long Videos
Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen

TL;DR
This paper introduces LongVideoAgent, a multi-agent system that enhances reasoning over hour-long videos by localizing relevant segments and extracting visual details, significantly improving performance on new episode-level datasets.
Contribution
The paper presents a novel multi-agent framework with reinforcement learning for long-video question answering, addressing limitations of prior methods that rely on summaries or limited tools.
Findings
Outperforms non-agent baselines on LongTVQA and LongTVQA+ datasets.
Reinforcement learning improves reasoning and planning capabilities.
Provides interpretable reasoning trajectories.
Abstract
Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Video Analysis and Summarization
