LOIS: Looking Out of Instance Semantics for Visual Question Answering
Siyu Zhang, Yeming Chen, Yaoru Sun, Fang Wang, Haibo Shi, Haoran Wang

TL;DR
LOIS introduces a novel instance semantics approach for VQA that enhances visual reasoning by using relation attention modules without relying on bounding boxes, leading to improved performance on benchmark datasets.
Contribution
The paper proposes LOIS, a bounding box-free framework with relation attention modules to better understand object semantics and improve VQA accuracy.
Findings
Outperforms existing methods on four VQA benchmarks.
Enhances visual reasoning by modeling semantic relations.
Effectively focuses on salient image regions for question answering.
Abstract
Visual question answering (VQA) has been intensively studied as a multimodal task that requires effort in bridging vision and language to infer answers correctly. Recent attempts have developed various attention-based modules for solving VQA tasks. However, the performance of model inference is largely bottlenecked by visual processing for semantics understanding. Most existing detection methods rely on bounding boxes, remaining a serious challenge for VQA models to understand the causal nexus of object semantics in images and correctly infer contextual information. To this end, we propose a finer model framework without bounding boxes in this work, termed Looking Out of Instance Semantics (LOIS) to tackle this important issue. LOIS enables more fine-grained feature descriptions to produce visual facts. Furthermore, to overcome the label ambiguity caused by instance masks, two types of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
