RGBX-R1: Visual Modality Chain-of-Thought Guided Reinforcement Learning for Multimodal Grounding

Jiahe Wu; Bing Cao; Qilong Wang; Qinghua Hu; Dongdong Li; Pengfei Zhu

arXiv:2602.00504·cs.CV·February 3, 2026

RGBX-R1: Visual Modality Chain-of-Thought Guided Reinforcement Learning for Multimodal Grounding

Jiahe Wu, Bing Cao, Qilong Wang, Qinghua Hu, Dongdong Li, Pengfei Zhu

PDF

Open Access

TL;DR

RGBX-R1 enhances multimodal large language models' perception and reasoning across various visual modalities using a novel chain-of-thought prompting strategy and a two-stage training process, significantly improving grounding performance.

Contribution

This paper introduces RGBX-R1, a new framework with a Visual Modality Chain-of-Thought and a two-stage training paradigm to extend MLLMs' capabilities to diverse visual modalities.

Findings

01

Outperforms baselines by 22.71% on RGBX grounding tasks

02

Constructs the first RGBX-Grounding benchmark

03

Demonstrates improved multimodal understanding and spatial perception

Abstract

Multimodal Large Language Models (MLLM) are primarily pre-trained on the RGB modality, thereby limiting their performance on other modalities, such as infrared, depth, and event data, which are crucial for complex scenarios. To address this, we propose RGBX-R1, a framework to enhance MLLM's perception and reasoning capacities across various X visual modalities. Specifically, we employ an Understand-Associate-Validate (UAV) prompting strategy to construct the Visual Modality Chain-of-Thought (VM-CoT), which aims to expand the MLLMs' RGB understanding capability into X modalities. To progressively enhance reasoning capabilities, we introduce a two-stage training paradigm: Cold-Start Supervised Fine-Tuning (CS-SFT) and Spatio-Temporal Reinforcement Fine-Tuning (ST-RFT). CS-SFT supervises the reasoning process with the guidance of VM-CoT, equipping the MLLM with fundamental modality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications