Causal Reasoning through Two Layers of Cognition for Improving Generalization in Visual Question Answering
Trang Nguyen, Naoaki Okazaki

TL;DR
This paper introduces CopVQA, a causal reasoning framework for VQA that enhances generalization by modeling interpretive and answer stages with distinct experts, achieving state-of-the-art results with smaller models.
Contribution
It proposes a novel two-layer cognitive pathway approach that emphasizes causal reasoning in multimodal VQA, improving generalization and performance across diverse datasets.
Findings
Achieves state-of-the-art on PathVQA dataset.
Improves generalization on VQA-CPv2, VQAv2, and VQA RAD.
Uses one-fourth the model size of current SOTA.
Abstract
Generalization in Visual Question Answering (VQA) requires models to answer questions about images with contexts beyond the training distribution. Existing attempts primarily refine unimodal aspects, overlooking enhancements in multimodal aspects. Besides, diverse interpretations of the input lead to various modes of answer generation, highlighting the role of causal reasoning between interpreting and answering steps in VQA. Through this lens, we propose Cognitive pathways VQA (CopVQA) improving the multimodal predictions by emphasizing causal reasoning factors. CopVQA first operates a pool of pathways that capture diverse causal reasoning flows through interpreting and answering stages. Mirroring human cognition, we decompose the responsibility of each stage into distinct experts and a cognition-enabled component (CC). The two CCs strategically execute one expert for each stage at a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
