End-to-End Video Question Answering with Frame Scoring Mechanisms and   Adaptive Sampling

Jianxin Liang; Xiaojun Meng; Yueqian Wang; Chang Liu; Qun Liu; Dongyan; Zhao

arXiv:2407.15047·cs.CV·July 24, 2024

End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

Jianxin Liang, Xiaojun Meng, Yueqian Wang, Chang Liu, Qun Liu, Dongyan, Zhao

PDF

TL;DR

This paper introduces VidF4, a novel VideoQA framework with frame scoring and adaptive sampling, significantly improving performance by selecting relevant frames for better visual-textual interaction.

Contribution

We propose a new frame scoring and adaptive sampling mechanism for VideoQA, enabling end-to-end training and outperforming existing methods on multiple benchmarks.

Findings

01

Outperforms existing VideoQA methods on three benchmarks.

02

Achieves new state-of-the-art results with +0.3%, +0.9%, +1.0% improvements.

03

Validates effectiveness through quantitative and qualitative analyses.

Abstract

Video Question Answering (VideoQA) has emerged as a challenging frontier in the field of multimedia processing, requiring intricate interactions between visual and textual modalities. Simply uniformly sampling frames or indiscriminately aggregating frame-level visual features often falls short in capturing the nuanced and relevant contexts of videos to well perform VideoQA. To mitigate these issues, we propose VidF4, a novel VideoQA framework equipped with tailored frame selection strategy for effective and efficient VideoQA. We propose three frame-scoring mechanisms that consider both question relevance and inter-frame similarity to evaluate the importance of each frame for a given question on the video. Furthermore, we design a differentiable adaptive frame sampling mechanism to facilitate end-to-end training for the frame selector and answer generator. The experimental results across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.