Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models
Jierun Chen,Fangyun Wei,Jinjing Zhao,Sizhe Song,Bohuai Wu,Zhuoxuan, Peng,S.-H. Gary Chan,Hongyang Zhang

TL;DR
This paper critically examines existing benchmarks for referring expression comprehension, reveals high labeling errors, and introduces a new comprehensive benchmark, Ref-L4, to better evaluate modern large multimodal models.
Contribution
It identifies significant label noise in current REC benchmarks and proposes Ref-L4, a large, diverse, and detailed benchmark for more accurate evaluation of REC models.
Findings
High labeling error rates in existing benchmarks.
Significant accuracy improvements when excluding noisy data.
Ref-L4 provides a more comprehensive evaluation platform.
Abstract
Referring expression comprehension (REC) involves localizing a target instance based on a textual description. Recent advancements in REC have been driven by large multimodal models (LMMs) like CogVLM, which achieved 92.44% accuracy on RefCOCO. However, this study questions whether existing benchmarks such as RefCOCO, RefCOCO+, and RefCOCOg, capture LMMs' comprehensive capabilities. We begin with a manual examination of these benchmarks, revealing high labeling error rates: 14% in RefCOCO, 24% in RefCOCO+, and 5% in RefCOCOg, which undermines the authenticity of evaluations. We address this by excluding problematic instances and reevaluating several LMMs capable of handling the REC task, showing significant accuracy improvements, thus highlighting the impact of benchmark noise. In response, we introduce Ref-L4, a comprehensive REC benchmark, specifically designed to evaluate modern REC…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Language, Metaphor, and Cognition · Interpreting and Communication in Healthcare
