An Empirical Study on How Video-LLMs Answer Video Questions

Chenhui Gou; Ziyu Ma; Zicheng Duan; Haoyu He; Feng Chen; Akide Liu; Bohan Zhuang; Jianfei Cai; Hamid Rezatofighi

arXiv:2508.15360·cs.CV·August 22, 2025

An Empirical Study on How Video-LLMs Answer Video Questions

Chenhui Gou, Ziyu Ma, Zicheng Duan, Haoyu He, Feng Chen, Akide Liu, Bohan Zhuang, Jianfei Cai, Hamid Rezatofighi

PDF

Open Access

TL;DR

This paper systematically investigates how Video-LLMs internally process video content for question answering, revealing the roles of different layers and attention mechanisms, and proposes ways to improve efficiency based on these insights.

Contribution

It introduces a novel empirical analysis using attention knockouts to interpret Video-LLMs, uncovering layer-specific functions and guiding efficiency improvements.

Findings

01

Video information extraction mainly occurs in early layers.

02

Intermediate layers act as critical outliers in fine-grained analysis.

03

Spatial-temporal modeling relies more on language-guided retrieval than intra-video attention.

Abstract

Taking advantage of large-scale data and pretrained language models, Video Large Language Models (Video-LLMs) have shown strong capabilities in answering video questions. However, most existing efforts focus on improving performance, with limited attention to understanding their internal mechanisms. This paper aims to bridge this gap through a systematic empirical study. To interpret existing VideoLLMs, we adopt attention knockouts as our primary analytical tool and design three variants: Video Temporal Knockout, Video Spatial Knockout, and Language-to-Video Knockout. Then, we apply these three knockouts on different numbers of layers (window of layers). By carefully controlling the window of layers and types of knockouts, we provide two settings: a global setting and a fine-grained setting. Our study reveals three key findings: (1) Global setting indicates Video information extraction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning