Attention Mechanism based Cognition-level Scene Understanding
Xuejiao Tang, Wenbin Zhang

TL;DR
This paper introduces PAVCR, a parallel attention-based model for visual commonsense reasoning that improves information fusion and inference, demonstrating significant performance gains and interpretability on the VCR benchmark.
Contribution
The paper presents a novel parallel attention-based architecture that enhances semantic encoding and reasoning in VCR tasks, addressing limitations of previous long-sequence models.
Findings
Significant performance improvements on VCR dataset
Enhanced interpretability of reasoning process
Effective fusion of visual and textual information
Abstract
Given a question-image input, the Visual Commonsense Reasoning (VCR) model can predict an answer with the corresponding rationale, which requires inference ability from the real world. The VCR task, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge, is a cognition-level scene understanding task. The VCR task has aroused researchers' interest due to its wide range of applications, including visual question answering, automated vehicle systems, and clinical decision support. Previous approaches to solving the VCR task generally rely on pre-training or exploiting memory with long dependency relationship encoded models. However, these approaches suffer from a lack of generalizability and losing information in long sequences. In this paper, we propose a parallel attention-based cognitive VCR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
