VideoDistill: Language-aware Vision Distillation for Video Question Answering
Bo Zou, Chao Yang, Yu Qiao, Chengbin Quan, Youjian Zhao

TL;DR
VideoDistill introduces a goal-driven, language-aware framework for VideoQA that enhances visual and answer generation by focusing on question-related visual cues, achieving state-of-the-art results and reducing language shortcut reliance.
Contribution
The paper presents a novel language-aware gating mechanism and a selective frame sampling strategy to improve visual question answering in videos, closely mimicking human reasoning.
Findings
Achieves state-of-the-art performance on multiple VideoQA benchmarks.
Effectively reduces reliance on language shortcuts in VideoQA.
Enhances focus on question-relevant visual information.
Abstract
Significant advancements in video question answering (VideoQA) have been made thanks to thriving large image-language pretraining frameworks. Although these image-language models can efficiently represent both video and language branches, they typically employ a goal-free vision perception process and do not interact vision with language well during the answer generation, thus omitting crucial visual cues. In this paper, we are inspired by the human recognition and learning pattern and propose VideoDistill, a framework with language-aware (i.e., goal-driven) behavior in both vision perception and answer generation process. VideoDistill generates answers only from question-related visual embeddings and follows a thinking-observing-answering approach that closely resembles human behavior, distinguishing it from previous research. Specifically, we develop a language-aware gating mechanism…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
