LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering

Xinxin Dong; Baoyun Peng; Haokai Ma; Yufei Wang; Zixuan Dong; Fei Hu; Xiaodong Wang

arXiv:2507.14784·cs.CV·August 19, 2025

LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering

Xinxin Dong, Baoyun Peng, Haokai Ma, Yufei Wang, Zixuan Dong, Fei Hu, Xiaodong Wang

PDF

Open Access

TL;DR

LeAdQA introduces a novel framework combining causal-aware query refinement and fine-grained visual grounding, significantly improving complex reasoning in VideoQA tasks by leveraging LLMs and adaptive fusion for precise segment retrieval.

Contribution

This paper presents LeAdQA, a new approach that enhances VideoQA by integrating causal-aware query reformulation with targeted visual grounding, addressing limitations of previous methods.

Findings

01

Achieves state-of-the-art performance on NExT-QA, IntentQA, and NExT-GQA datasets.

02

Demonstrates improved understanding of causal-temporal structures in videos.

03

Enhances reasoning accuracy while maintaining computational efficiency.

Abstract

Video Question Answering (VideoQA) requires identifying sparse critical moments in long videos and reasoning about their causal relationships to answer semantically complex questions. While recent advances in multimodal learning have improved alignment and fusion, current approaches remain limited by two prevalent but fundamentally flawed strategies: (1) task-agnostic sampling indiscriminately processes all frames, overwhelming key events with irrelevant content; and (2) heuristic retrieval captures superficial patterns but misses causal-temporal structures needed for complex reasoning. To address these challenges, we introduce LeAdQA, an innovative approach that bridges these gaps through synergizing causal-aware query refinement with fine-grained visual grounding. Our method first leverages LLMs to reformulate question-option pairs, resolving causal ambiguities and sharpening temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques