Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives
Shaoning Xiao, Long Chen, Kaifeng Gao, Zhao Wang, Yi Yang, Zhimeng, Zhang, Jun Xiao

TL;DR
This paper proposes a novel multi-modal alignment approach for VideoQA that leverages trajectory features and hierarchical graph architectures, addressing semantic gaps and improving performance on the NExT-QA benchmark.
Contribution
It introduces a feature and sample perspective to enhance visual-language alignment, including trajectory-based features and training augmentations to reduce language priors.
Findings
Outperforms state-of-the-art on NExT-QA benchmark
Enhances trajectory-level and frame-level alignment
Strengthens cross-modal correspondence with augmentation strategies
Abstract
Reasoning about causal and temporal event relations in videos is a new destination of Video Question Answering (VideoQA).The major stumbling block to achieve this purpose is the semantic gap between language and video since they are at different levels of abstraction. Existing efforts mainly focus on designing sophisticated architectures while utilizing frame- or object-level visual representations. In this paper, we reconsider the multi-modal alignment problem in VideoQA from feature and sample perspectives to achieve better performance. From the view of feature,we break down the video into trajectories and first leverage trajectory feature in VideoQA to enhance the alignment between two modalities. Moreover, we adopt a heterogeneous graph architecture and design a hierarchical framework to align both trajectory-level and frame-level visual feature with language feature. In addition,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsALIGN
