Chain-of-Caption: Training-free improvement of multimodal large language model on referring expression comprehension

Yik Lung Pang; Changjae Oh

arXiv:2602.08211·cs.CV·February 10, 2026

Chain-of-Caption: Training-free improvement of multimodal large language model on referring expression comprehension

Yik Lung Pang, Changjae Oh

PDF

Open Access

TL;DR

This paper introduces Chain-of-Caption, a training-free method that enhances multimodal large language models' ability to localize objects based on referring expressions by using combined visual and textual contexts, achieving significant accuracy improvements.

Contribution

The paper proposes a novel training-free framework, Chain-of-Caption, that improves referring expression comprehension performance by leveraging multiple contexts without additional training.

Findings

01

Individual contexts improve REC performance without fine-tuning

02

Combining multiple contexts yields 5-30% accuracy gains

03

Effective across multiple datasets like RefCOCO and Ref-L4

Abstract

Given a textual description, the task of referring expression comprehension (REC) involves the localisation of the referred object in an image. Multimodal large language models (MLLMs) have achieved high accuracy on REC benchmarks through scaling up the model size and training data. Moreover, the performance of MLLMs can be further improved using techniques such as Chain-of-Thought and tool use, which provides additional visual or textual context to the model. In this paper, we analyse the effect of various techniques for providing additional visual and textual context via tool use to the MLLM and its effect on the REC task. Furthermore, we propose a training-free framework named Chain-of-Caption to improve the REC performance of MLLMs. We perform experiments on RefCOCO/RefCOCOg/RefCOCO+ and Ref-L4 datasets and show that individual textual or visual context can improve the REC…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems