Redundancy-aware Transformer for Video Question Answering
Yicong Li, Xun Yang, An Zhang, Chun Feng, Xiang Wang, Tat-Seng Chua

TL;DR
This paper introduces RaFormer, a transformer architecture for VideoQA that reduces redundancy by focusing on object-level changes and selective vision-language interactions, leading to state-of-the-art results.
Contribution
The paper proposes a novel redundancy-aware transformer that addresses neighboring-frame and cross-modal redundancies in VideoQA tasks, improving performance.
Findings
Achieves state-of-the-art results on multiple VideoQA benchmarks.
Effectively reduces neighboring-frame redundancy by focusing on object-level changes.
Improves vision-language interaction efficiency through adaptive sampling.
Abstract
This paper identifies two kinds of redundancy in the current VideoQA paradigm. Specifically, the current video encoders tend to holistically embed all video clues at different granularities in a hierarchical manner, which inevitably introduces \textit{neighboring-frame redundancy} that can overwhelm detailed visual clues at the object level. Subsequently, prevailing vision-language fusion designs introduce the \textit{cross-modal redundancy} by exhaustively fusing all visual elements with question tokens without explicitly differentiating their pairwise vision-language interactions, thus making a pernicious impact on the answering. To this end, we propose a novel transformer-based architecture, that aims to model VideoQA in a redundancy-aware manner. To address the neighboring-frame redundancy, we introduce a video encoder structure that emphasizes the object-level change in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
