Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning
Qing Jiang, Xingyu Chen, Zhaoyang Zeng, Junzhi Yu, Lei Zhang

TL;DR
Rex-Thinker introduces a grounded, interpretable chain-of-thought reasoning approach for object referring, improving accuracy, explainability, and rejection of incorrect matches through structured reasoning over candidate objects.
Contribution
This work formulates object referring as an explicit chain-of-thought reasoning task and creates a new dataset, HumanRef-CoT, to facilitate interpretable reasoning in referring models.
Findings
Outperforms baselines in precision and interpretability
Enhances rejection of hallucinated outputs
Shows strong out-of-domain generalization
Abstract
Object referring aims to detect all objects in an image that match a given natural language description. We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content. Specifically, it should satisfy two key properties: 1) Verifiable, by producing interpretable reasoning that justifies its predictions and clearly links them to visual evidence; and 2) Trustworthy, by learning to abstain when no object in the image satisfies the given expression. However, most methods treat referring as a direct bounding box prediction task, offering limited interpretability and struggling to reject expressions with no matching object. In this work, we propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task. Given a referring expression, we first identify all candidate object…
Peer Reviews
Decision·ICLR 2026 Poster
1. Using RL for Referring Expression Comprehension is under-explored, beyond early efforts ([a-b] in Weaknesses below). 2. The improvements from SFT/RL post-training look encouraging, although not totally convincing (See Weaknesses 2 and 3).
1. Citations to previous REC + RL works are absent. For example, [a-b]. Instead, the paper only cites generic VLM works with RL, in which REC is only a subtask. 2. In Table 3, Rex-Thinker-CoT and Rex-Thinker-GRPO perform worse than QwenVL-2.5-7B (the base model of Rex-Thinker), which seems to be a sign of catastrophic forgetting due to post-training. The authors should find a way to mitigate this. 3. The main experimental results (Tables 2 and 4) are on one dataset only, the HumanRef. 4. (Minor
The plan–act–summarize CoT exposes intermediate reasoning tied to concrete boxes. This improves debuggability and reduces hallucination risk. It also enables a principled “no target” refusal. This paper implements the SFT-then-RL framework on the REC with reasoning for MLLMs. The reward combines F1 for grounded detection with a lightweight format constraint. This directly optimizes what the benchmark cares about.
The paper does not report even small-scale human analysis/evaluation of the GPT-4o–generated chain-of-thought data. Quality control relies mainly on some rule-based functions like answer-conditioned prompts and automatic consistency filtering (keeping only samples whose final prediction matches ground truth). This may introduce bias in the framework and lacks inter-annotator checks as GPT-4o is not the most advanced model and the data is generated data. As a result, the reliability and transfera
1. The authors tackle the task of Grounded Object Referring from a novel perspective (Chain-of-Thought reasoning), providing a new, interpretable approach. 2. The authors have constructed a high-quality dataset that can facilitate the development of the research community. 3. The paper is well-written and clearly organized.
1. The methodology seems largely built on recent "R1-like RL" and "think-with-images" paradigm, which lacks novelty. 2. The paper lacks validation for the annotations generated by GPT-4o. Given that commercial models have been shown to have issues (e.g., hallucination), a manual review and evaluation of the annotated data is necessary to ensure its quality. 3. The paper lacks comparisons with the recent "Think-with-image" paradigm, e.g., Deepeyes, Pixel-Reasoner, and GRIT. Considering that Rex-T
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)
