MPCAR: Multi-Perspective Contextual Augmentation for Enhanced Visual Reasoning in Large Vision-Language Models
Amirul Rahman, Qiang Xu, Xueying Huang

TL;DR
MPCAR enhances large vision-language models' complex visual reasoning by generating multiple perspectives and integrating them into prompts, significantly improving accuracy without fine-tuning on VQA tasks.
Contribution
This paper introduces MPCAR, a novel inference-time strategy that uses multi-perspective contextual augmentation to boost LVLM reasoning without model fine-tuning.
Findings
Significant accuracy improvements on GQA, VQA-CP v2, and ScienceQA datasets.
Enhanced answer coherence and completeness confirmed by human evaluations.
Ablation studies highlight the importance of diverse prompts and perspective count.
Abstract
Despite significant advancements, Large Vision-Language Models (LVLMs) continue to face challenges in complex visual reasoning tasks that demand deep contextual understanding, multi-angle analysis, or meticulous detail recognition. Existing approaches often rely on single-shot image encoding and prompts, limiting their ability to fully capture nuanced visual information. Inspired by the notion that strategically generated "additional" information can serve as beneficial contextual augmentation, we propose Multi-Perspective Contextual Augmentation for Reasoning (MPCAR), a novel inference-time strategy designed to enhance LVLM performance. MPCAR operates in three stages: first, an LVLM generates N diverse and complementary descriptions or preliminary reasoning paths from various angles; second, these descriptions are intelligently integrated with the original question to construct a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
