End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling
Jianxin Liang, Xiaojun Meng, Yueqian Wang, Chang Liu, Qun Liu, Dongyan, Zhao

TL;DR
This paper introduces VidF4, a novel VideoQA framework with frame scoring and adaptive sampling, significantly improving performance by selecting relevant frames for better visual-textual interaction.
Contribution
We propose a new frame scoring and adaptive sampling mechanism for VideoQA, enabling end-to-end training and outperforming existing methods on multiple benchmarks.
Findings
Outperforms existing VideoQA methods on three benchmarks.
Achieves new state-of-the-art results with +0.3%, +0.9%, +1.0% improvements.
Validates effectiveness through quantitative and qualitative analyses.
Abstract
Video Question Answering (VideoQA) has emerged as a challenging frontier in the field of multimedia processing, requiring intricate interactions between visual and textual modalities. Simply uniformly sampling frames or indiscriminately aggregating frame-level visual features often falls short in capturing the nuanced and relevant contexts of videos to well perform VideoQA. To mitigate these issues, we propose VidF4, a novel VideoQA framework equipped with tailored frame selection strategy for effective and efficient VideoQA. We propose three frame-scoring mechanisms that consider both question relevance and inter-frame similarity to evaluate the importance of each frame for a given question on the video. Furthermore, we design a differentiable adaptive frame sampling mechanism to facilitate end-to-end training for the frame selector and answer generator. The experimental results across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
