Saliency-R1: Incentivizing Unified Saliency Reasoning Capability in MLLM with Confidence-Guided Reinforcement Learning

Long Li; Shuichen Ji; Ziyang Luo; Zhihui Li; Dingwen Zhang; Junwei Han; Nian Liu

arXiv:2511.00396·cs.CV·November 27, 2025

Saliency-R1: Incentivizing Unified Saliency Reasoning Capability in MLLM with Confidence-Guided Reinforcement Learning

Long Li, Shuichen Ji, Ziyang Luo, Zhihui Li, Dingwen Zhang, Junwei Han, Nian Liu

PDF

Open Access

TL;DR

Saliency-R1 introduces a unified multimodal large language model framework that jointly addresses multiple saliency tasks, enhancing visual saliency reasoning with a novel training algorithm and structured textual interfaces.

Contribution

The paper presents the first unified MLLM framework for multiple saliency tasks and a novel reinforcement learning method, CGPO, to efficiently train the model.

Findings

01

Outperforms existing models on saliency tasks

02

Effectively encodes region- and instance-level references

03

Reduces training overhead with CGPO

Abstract

Although multimodal large language models (MLLMs) excel in high-level vision-language reasoning, they lack inherent awareness of visual saliency, making it difficult to identify key visual elements. To bridge this gap, we propose Saliency-R1, the first unified MLLM framework that jointly tackles three representative and heterogeneous saliency tasks: Salient Object Detection (SOD), Salient Instance Segmentation (SIS), and Co-salient Object Detection (CoSOD), enhancing the model's capacity for saliency reasoning. We introduce a textual interface with structured tags (<rg>, <ins>) to encode region- and instance-level referring expressions, enabling a single referring segmenter to produce task-appropriate masks. To train the MLLM efficiently, we propose Confidence-Guided Policy Optimization (CGPO), a novel single-sample reinforcement learning algorithm. CGPO improves on GRPO by replacing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Advanced Neural Network Applications