VideoQA in the Era of LLMs: An Empirical Study

Junbin Xiao; Nanxin Huang; Hangyu Qin; Dongyang Li; Yicong Li; Fengbin Zhu; Zhulin Tao; Jianxing Yu; Liang Lin; Tat-Seng Chua; Angela Yao

arXiv:2408.04223·cs.CV·June 17, 2025

VideoQA in the Era of LLMs: An Empirical Study

Junbin Xiao, Nanxin Huang, Hangyu Qin, Dongyang Li, Yicong Li, Fengbin Zhu, Zhulin Tao, Jianxing Yu, Liang Lin, Tat-Seng Chua, Angela Yao

PDF

Open Access 1 Repo

TL;DR

This paper provides an empirical analysis of Video Large Language Models' performance in Video Question Answering, highlighting their strengths in content correlation and their weaknesses in temporal reasoning, robustness, and interpretability.

Contribution

It offers a comprehensive study of Video-LLMs' behavior in VideoQA, revealing their capabilities and limitations, and emphasizes the need for improved robustness and explainability.

Findings

01

Video-LLMs excel at correlating contextual cues and generating plausible answers.

02

Models struggle with reasoning about temporal content and grounding temporal moments.

03

They are insensitive to adversarial perturbations but sensitive to simple variations.

Abstract

Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-language tasks. As a golden testbed, Video Question Answering (VideoQA) plays pivotal role in Video-LLM developing. This work conducts a timely and comprehensive study of Video-LLMs' behavior in VideoQA, aiming to elucidate their success and failure modes, and provide insights towards more human-like video understanding and question answering. Our analyses demonstrate that Video-LLMs excel in VideoQA; they can correlate contextual cues and generate plausible responses to questions about varied video contents. However, models falter in handling video temporality, both in reasoning about temporal content ordering and grounding QA-relevant temporal moments. Moreover, the models behave unintuitively - they are unresponsive to adversarial video perturbations while being sensitive to simple variations of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

doc-doc/videoqa-llms
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security