MPCAR: Multi-Perspective Contextual Augmentation for Enhanced Visual Reasoning in Large Vision-Language Models

Amirul Rahman; Qiang Xu; Xueying Huang

arXiv:2508.12400·cs.CV·August 19, 2025

MPCAR: Multi-Perspective Contextual Augmentation for Enhanced Visual Reasoning in Large Vision-Language Models

Amirul Rahman, Qiang Xu, Xueying Huang

PDF

Open Access

TL;DR

MPCAR enhances large vision-language models' complex visual reasoning by generating multiple perspectives and integrating them into prompts, significantly improving accuracy without fine-tuning on VQA tasks.

Contribution

This paper introduces MPCAR, a novel inference-time strategy that uses multi-perspective contextual augmentation to boost LVLM reasoning without model fine-tuning.

Findings

01

Significant accuracy improvements on GQA, VQA-CP v2, and ScienceQA datasets.

02

Enhanced answer coherence and completeness confirmed by human evaluations.

03

Ablation studies highlight the importance of diverse prompts and perspective count.

Abstract

Despite significant advancements, Large Vision-Language Models (LVLMs) continue to face challenges in complex visual reasoning tasks that demand deep contextual understanding, multi-angle analysis, or meticulous detail recognition. Existing approaches often rely on single-shot image encoding and prompts, limiting their ability to fully capture nuanced visual information. Inspired by the notion that strategically generated "additional" information can serve as beneficial contextual augmentation, we propose Multi-Perspective Contextual Augmentation for Reasoning (MPCAR), a novel inference-time strategy designed to enhance LVLM performance. MPCAR operates in three stages: first, an LVLM generates N diverse and complementary descriptions or preliminary reasoning paths from various angles; second, these descriptions are intelligently integrated with the original question to construct a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques