Rethinking Multi-Modal Alignment in Video Question Answering from   Feature and Sample Perspectives

Shaoning Xiao; Long Chen; Kaifeng Gao; Zhao Wang; Yi Yang; Zhimeng; Zhang; Jun Xiao

arXiv:2204.11544·cs.CV·November 3, 2022·5 cites

Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives

Shaoning Xiao, Long Chen, Kaifeng Gao, Zhao Wang, Yi Yang, Zhimeng, Zhang, Jun Xiao

PDF

Open Access

TL;DR

This paper proposes a novel multi-modal alignment approach for VideoQA that leverages trajectory features and hierarchical graph architectures, addressing semantic gaps and improving performance on the NExT-QA benchmark.

Contribution

It introduces a feature and sample perspective to enhance visual-language alignment, including trajectory-based features and training augmentations to reduce language priors.

Findings

01

Outperforms state-of-the-art on NExT-QA benchmark

02

Enhances trajectory-level and frame-level alignment

03

Strengthens cross-modal correspondence with augmentation strategies

Abstract

Reasoning about causal and temporal event relations in videos is a new destination of Video Question Answering (VideoQA).The major stumbling block to achieve this purpose is the semantic gap between language and video since they are at different levels of abstraction. Existing efforts mainly focus on designing sophisticated architectures while utilizing frame- or object-level visual representations. In this paper, we reconsider the multi-modal alignment problem in VideoQA from feature and sample perspectives to achieve better performance. From the view of feature,we break down the video into trajectories and first leverage trajectory feature in VideoQA to enhance the alignment between two modalities. Moreover, we adopt a heterogeneous graph architecture and design a hierarchical framework to align both trajectory-level and frame-level visual feature with language feature. In addition,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsALIGN