Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models

Jeonghwan Kim; Renjie Tao; Sanat Sharma; Jiaqi Wang; Kai Sun; Zhaojiang Lin; Seungwhan Moon; Lambert Mathias; Anuj Kumar; Heng Ji; Xin Luna Dong

arXiv:2601.19060·cs.CV·January 28, 2026

Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models

Jeonghwan Kim, Renjie Tao, Sanat Sharma, Jiaqi Wang, Kai Sun, Zhaojiang Lin, Seungwhan Moon, Lambert Mathias, Anuj Kumar, Heng Ji, Xin Luna Dong

PDF

Open Access

TL;DR

PixSearch is an end-to-end multimodal model that unifies perception and retrieval-augmented reasoning for visual question answering, improving factual accuracy and generalization without modular pipelines.

Contribution

It introduces PixSearch, the first model to integrate region-level perception with retrieval-augmented reasoning in a unified framework, eliminating reliance on separate modules.

Findings

01

19.7% relative accuracy gain on CRAG-MM

02

Improves factual consistency and generalization

03

Retains competitive reasoning performance

Abstract

Visual Question Answering (VQA) often requires coupling fine-grained perception with factual knowledge beyond the input image. Prior multimodal Retrieval-Augmented Generation (MM-RAG) systems improve factual grounding but lack an internal policy for when and how to retrieve. We propose PixSearch, the first end-to-end Segmenting Large Multimodal Model (LMM) that unifies region-level perception and retrieval-augmented reasoning. During encoding, PixSearch emits <search> tokens to trigger retrieval, selects query modalities (text, image, or region), and generates pixel-level masks that directly serve as visual queries, eliminating the reliance on modular pipelines (detectors, segmenters, captioners, etc.). A two-stage supervised fine-tuning regimen with search-interleaved supervision teaches retrieval timing and query selection while preserving segmentation ability. On egocentric and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling