Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

Qing Jiang; Xingyu Chen; Zhaoyang Zeng; Junzhi Yu; Lei Zhang

arXiv:2506.04034·cs.CV·June 5, 2025

Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

Qing Jiang, Xingyu Chen, Zhaoyang Zeng, Junzhi Yu, Lei Zhang

PDF

Open Access 2 Models 2 Datasets 3 Reviews

TL;DR

Rex-Thinker introduces a grounded, interpretable chain-of-thought reasoning approach for object referring, improving accuracy, explainability, and rejection of incorrect matches through structured reasoning over candidate objects.

Contribution

This work formulates object referring as an explicit chain-of-thought reasoning task and creates a new dataset, HumanRef-CoT, to facilitate interpretable reasoning in referring models.

Findings

01

Outperforms baselines in precision and interpretability

02

Enhances rejection of hallucinated outputs

03

Shows strong out-of-domain generalization

Abstract

Object referring aims to detect all objects in an image that match a given natural language description. We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content. Specifically, it should satisfy two key properties: 1) Verifiable, by producing interpretable reasoning that justifies its predictions and clearly links them to visual evidence; and 2) Trustworthy, by learning to abstain when no object in the image satisfies the given expression. However, most methods treat referring as a direct bounding box prediction task, offering limited interpretability and struggling to reject expressions with no matching object. In this work, we propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task. Given a referring expression, we first identify all candidate object…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. Using RL for Referring Expression Comprehension is under-explored, beyond early efforts ([a-b] in Weaknesses below). 2. The improvements from SFT/RL post-training look encouraging, although not totally convincing (See Weaknesses 2 and 3).

Weaknesses

1. Citations to previous REC + RL works are absent. For example, [a-b]. Instead, the paper only cites generic VLM works with RL, in which REC is only a subtask. 2. In Table 3, Rex-Thinker-CoT and Rex-Thinker-GRPO perform worse than QwenVL-2.5-7B (the base model of Rex-Thinker), which seems to be a sign of catastrophic forgetting due to post-training. The authors should find a way to mitigate this. 3. The main experimental results (Tables 2 and 4) are on one dataset only, the HumanRef. 4. (Minor

Reviewer 02Rating 4Confidence 4

Strengths

The plan–act–summarize CoT exposes intermediate reasoning tied to concrete boxes. This improves debuggability and reduces hallucination risk. It also enables a principled “no target” refusal. This paper implements the SFT-then-RL framework on the REC with reasoning for MLLMs. The reward combines F1 for grounded detection with a lightweight format constraint. This directly optimizes what the benchmark cares about.

Weaknesses

The paper does not report even small-scale human analysis/evaluation of the GPT-4o–generated chain-of-thought data. Quality control relies mainly on some rule-based functions like answer-conditioned prompts and automatic consistency filtering (keeping only samples whose final prediction matches ground truth). This may introduce bias in the framework and lacks inter-annotator checks as GPT-4o is not the most advanced model and the data is generated data. As a result, the reliability and transfera

Reviewer 03Rating 6Confidence 4

Strengths

1. The authors tackle the task of Grounded Object Referring from a novel perspective (Chain-of-Thought reasoning), providing a new, interpretable approach. 2. The authors have constructed a high-quality dataset that can facilitate the development of the research community. 3. The paper is well-written and clearly organized.

Weaknesses

1. The methodology seems largely built on recent "R1-like RL" and "think-with-images" paradigm, which lacks novelty. 2. The paper lacks validation for the annotations generated by GPT-4o. Given that commercial models have been shown to have issues (e.g., hallucination), a manual review and evaluation of the annotated data is necessary to ensure its quality. 3. The paper lacks comparisons with the recent "Think-with-image" paradigm, e.g., Deepeyes, Pixel-Reasoner, and GRIT. Considering that Rex-T

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)