Align and Aggregate: Compositional Reasoning with Video Alignment and   Answer Aggregation for Video Question-Answering

Zhaohe Liao; Jiangtong Li; Li Niu; Liqing Zhang

arXiv:2407.03008·cs.CV·July 4, 2024

Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering

Zhaohe Liao, Jiangtong Li, Li Niu, Liqing Zhang

PDF

Open Access

TL;DR

This paper introduces VA³, a model-agnostic framework that improves compositional reasoning and accuracy in VideoQA by integrating video alignment and answer aggregation modules, with enhanced interpretability.

Contribution

The paper proposes a novel VA³ framework that enhances existing VideoQA methods' compositional consistency and accuracy through hierarchical video alignment and answer aggregation modules.

Findings

01

Improves compositional consistency of VideoQA methods.

02

Enhances accuracy of existing VideoQA models.

03

Provides more interpretable VideoQA systems.

Abstract

Despite the recent progress made in Video Question-Answering (VideoQA), these methods typically function as black-boxes, making it difficult to understand their reasoning processes and perform consistent compositional reasoning. To address these challenges, we propose a \textit{model-agnostic} Video Alignment and Answer Aggregation (VA $^{3}$ ) framework, which is capable of enhancing both compositional consistency and accuracy of existing VidQA methods by integrating video aligner and answer aggregator modules. The video aligner hierarchically selects the relevant video clips based on the question, while the answer aggregator deduces the answer to the question based on its sub-questions, with compositional consistency ensured by the information flow along question decomposition graph and the contrastive learning strategy. We evaluate our framework on three settings of the AGQA-Decomp…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsContrastive Learning