PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation

Muntasir Wahed; Kiet A. Nguyen; Adheesh Sunil Juvekar; Xinzhuo Li; Xiaona Zhou; Vedant Shah; Tianjiao Yu; Pinar Yanardag; Ismini Lourentzou

arXiv:2412.15209·cs.CV·December 2, 2025

PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation

Muntasir Wahed, Kiet A. Nguyen, Adheesh Sunil Juvekar, Xinzhuo Li, Xiaona Zhou, Vedant Shah, Tianjiao Yu, Pinar Yanardag, Ismini Lourentzou

PDF

Open Access

TL;DR

PRIMA introduces a multi-image vision-language model capable of pixel-grounded reasoning across multiple images, addressing limitations of existing models by integrating pixel-level grounding with multi-image understanding for detailed visual reasoning.

Contribution

The paper presents PRIMA, a novel LVLM that combines pixel-level grounding with multi-image reasoning, and introduces SQuARE, a module for cross-image relational context, along with a new benchmark M4SEG.

Findings

01

PRIMA achieves significant improvements in Recall and S-IoU over baselines.

02

SQuARE effectively captures cross-image relationships.

03

The M4SEG benchmark facilitates multi-image pixel-grounded reasoning evaluation.

Abstract

Despite significant advancements in Large Vision-Language Models (LVLMs)' capabilities, existing pixel-grounding models operate in single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning alongside PRIMA, an LVLM that integrates pixel-level grounding with robust multi-image reasoning to produce contextually rich, pixel-grounded explanations. Central to PRIMA is SQuARE, a vision module that injects cross-image relational context into compact query-based visual tokens before fusing them with the language backbone. To support training and evaluation, we curate M4SEG, a new multi-image reasoning segmentation benchmark consisting of $\sim$ 744K question-answer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques