LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs
Yunxin Li, Xinyu Chen, Baotain Hu, Min Zhang

TL;DR
This paper introduces an Interactive Visual Adapter (IVA) for large language models to improve long video question answering by enabling fine-grained visual interactions, reducing computational costs, and enhancing understanding of lengthy videos.
Contribution
The paper proposes a novel IVA module integrated into LLMs, specifically designed for long video understanding, with a lightweight temporal frame selector and spatial feature interactor.
Findings
Significant performance improvements on nine video understanding benchmarks.
Effective handling of long videos with enhanced visual interaction.
Ablation studies confirm IVA's contribution to understanding both long and short videos.
Abstract
Long video understanding is a significant and ongoing challenge in the intersection of multimedia and artificial intelligence. Employing large language models (LLMs) for comprehending video becomes an emerging and promising method. However, this approach incurs high computational costs due to the extensive array of video tokens, experiences reduced visual clarity as a consequence of token aggregation, and confronts challenges arising from irrelevant visual tokens while answering video-related questions. To alleviate these issues, we present an Interactive Visual Adapter (IVA) within LLMs, designed to enhance interaction with fine-grained visual elements. Specifically, we first transform long videos into temporal video tokens via leveraging a visual encoder alongside a pretrained causal transformer, then feed them into LLMs with the video instructions. Subsequently, we integrated IVA,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOnline Learning and Analytics
MethodsAdapter
