PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation
Muntasir Wahed, Kiet A. Nguyen, Adheesh Sunil Juvekar, Xinzhuo Li, Xiaona Zhou, Vedant Shah, Tianjiao Yu, Pinar Yanardag, Ismini Lourentzou

TL;DR
PRIMA introduces a multi-image vision-language model capable of pixel-grounded reasoning across multiple images, addressing limitations of existing models by integrating pixel-level grounding with multi-image understanding for detailed visual reasoning.
Contribution
The paper presents PRIMA, a novel LVLM that combines pixel-level grounding with multi-image reasoning, and introduces SQuARE, a module for cross-image relational context, along with a new benchmark M4SEG.
Findings
PRIMA achieves significant improvements in Recall and S-IoU over baselines.
SQuARE effectively captures cross-image relationships.
The M4SEG benchmark facilitates multi-image pixel-grounded reasoning evaluation.
Abstract
Despite significant advancements in Large Vision-Language Models (LVLMs)' capabilities, existing pixel-grounding models operate in single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning alongside PRIMA, an LVLM that integrates pixel-level grounding with robust multi-image reasoning to produce contextually rich, pixel-grounded explanations. Central to PRIMA is SQuARE, a vision module that injects cross-image relational context into compact query-based visual tokens before fusing them with the language backbone. To support training and evaluation, we curate M4SEG, a new multi-image reasoning segmentation benchmark consisting of 744K question-answer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
