Cross-Modal Causal Relational Reasoning for Event-Level Visual Question   Answering

Yang Liu; Guanbin Li; Liang Lin

arXiv:2207.12647·cs.CV·June 8, 2023·6 cites

Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering

Yang Liu, Guanbin Li, Liang Lin

PDF

Open Access 2 Repos

TL;DR

This paper introduces a novel framework called CMCIR that enhances event-level visual question answering by modeling causal relationships across visual and linguistic modalities, addressing spurious correlations and capturing event dynamics.

Contribution

The work proposes a cross-modal causal relational reasoning framework with causal intervention operations, improving the understanding of event temporality and causality in VQA tasks.

Findings

01

Outperforms existing methods on four event-level datasets.

02

Effectively discovers visual-linguistic causal structures.

03

Enhances robustness in event-level visual question answering.

Abstract

Existing visual question answering methods often suffer from cross-modal spurious correlations and oversimplified event-level reasoning processes that fail to capture event temporality, causality, and dynamics spanning over the video. In this work, to address the task of event-level visual question answering, we propose a framework for cross-modal causal relational reasoning. In particular, a set of causal intervention operations is introduced to discover the underlying causal structures across visual and linguistic modalities. Our framework, named Cross-Modal Causal RelatIonal Reasoning (CMCIR), involves three modules: i) Causality-aware Visual-Linguistic Reasoning (CVLR) module for collaboratively disentangling the visual and linguistic spurious correlations via front-door and back-door causal interventions; ii) Spatial-Temporal Transformer (STT) module for capturing the fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Dropout · Adam · Byte Pair Encoding · Label Smoothing · Layer Normalization