Attention Mechanism based Cognition-level Scene Understanding

Xuejiao Tang; Wenbin Zhang

arXiv:2204.08027·cs.CV·March 10, 2025

Attention Mechanism based Cognition-level Scene Understanding

Xuejiao Tang, Wenbin Zhang

PDF

Open Access

TL;DR

This paper introduces PAVCR, a parallel attention-based model for visual commonsense reasoning that improves information fusion and inference, demonstrating significant performance gains and interpretability on the VCR benchmark.

Contribution

The paper presents a novel parallel attention-based architecture that enhances semantic encoding and reasoning in VCR tasks, addressing limitations of previous long-sequence models.

Findings

01

Significant performance improvements on VCR dataset

02

Enhanced interpretability of reasoning process

03

Effective fusion of visual and textual information

Abstract

Given a question-image input, the Visual Commonsense Reasoning (VCR) model can predict an answer with the corresponding rationale, which requires inference ability from the real world. The VCR task, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge, is a cognition-level scene understanding task. The VCR task has aroused researchers' interest due to its wide range of applications, including visual question answering, automated vehicle systems, and clinical decision support. Previous approaches to solving the VCR task generally rely on pre-training or exploiting memory with long dependency relationship encoded models. However, these approaches suffer from a lack of generalizability and losing information in long sequences. In this paper, we propose a parallel attention-based cognitive VCR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning