Causal Reasoning through Two Layers of Cognition for Improving   Generalization in Visual Question Answering

Trang Nguyen; Naoaki Okazaki

arXiv:2310.05410·cs.AI·October 10, 2023

Causal Reasoning through Two Layers of Cognition for Improving Generalization in Visual Question Answering

Trang Nguyen, Naoaki Okazaki

PDF

Open Access

TL;DR

This paper introduces CopVQA, a causal reasoning framework for VQA that enhances generalization by modeling interpretive and answer stages with distinct experts, achieving state-of-the-art results with smaller models.

Contribution

It proposes a novel two-layer cognitive pathway approach that emphasizes causal reasoning in multimodal VQA, improving generalization and performance across diverse datasets.

Findings

01

Achieves state-of-the-art on PathVQA dataset.

02

Improves generalization on VQA-CPv2, VQAv2, and VQA RAD.

03

Uses one-fourth the model size of current SOTA.

Abstract

Generalization in Visual Question Answering (VQA) requires models to answer questions about images with contexts beyond the training distribution. Existing attempts primarily refine unimodal aspects, overlooking enhancements in multimodal aspects. Besides, diverse interpretations of the input lead to various modes of answer generation, highlighting the role of causal reasoning between interpreting and answering steps in VQA. Through this lens, we propose Cognitive pathways VQA (CopVQA) improving the multimodal predictions by emphasizing causal reasoning factors. CopVQA first operates a pool of pathways that capture diverse causal reasoning flows through interpreting and answering stages. Mirroring human cognition, we decompose the responsibility of each stage into distinct experts and a cognition-enabled component (CC). The two CCs strategically execute one expert for each stage at a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning