Revisiting Referring Expression Comprehension Evaluation in the Era of   Large Multimodal Models

Jierun Chen,Fangyun Wei,Jinjing Zhao,Sizhe Song,Bohuai Wu,Zhuoxuan; Peng,S.-H. Gary Chan,Hongyang Zhang

arXiv:2406.16866·cs.CV·June 25, 2024·1 cites

Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models

Jierun Chen,Fangyun Wei,Jinjing Zhao,Sizhe Song,Bohuai Wu,Zhuoxuan, Peng,S.-H. Gary Chan,Hongyang Zhang

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper critically examines existing benchmarks for referring expression comprehension, reveals high labeling errors, and introduces a new comprehensive benchmark, Ref-L4, to better evaluate modern large multimodal models.

Contribution

It identifies significant label noise in current REC benchmarks and proposes Ref-L4, a large, diverse, and detailed benchmark for more accurate evaluation of REC models.

Findings

01

High labeling error rates in existing benchmarks.

02

Significant accuracy improvements when excluding noisy data.

03

Ref-L4 provides a more comprehensive evaluation platform.

Abstract

Referring expression comprehension (REC) involves localizing a target instance based on a textual description. Recent advancements in REC have been driven by large multimodal models (LMMs) like CogVLM, which achieved 92.44% accuracy on RefCOCO. However, this study questions whether existing benchmarks such as RefCOCO, RefCOCO+, and RefCOCOg, capture LMMs' comprehensive capabilities. We begin with a manual examination of these benchmarks, revealing high labeling error rates: 14% in RefCOCO, 24% in RefCOCO+, and 5% in RefCOCOg, which undermines the authenticity of evaluations. We address this by excluding problematic instances and reevaluating several LMMs capable of handling the REC task, showing significant accuracy improvements, thus highlighting the impact of benchmark noise. In response, we introduce Ref-L4, a comprehensive REC benchmark, specifically designed to evaluate modern REC…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jierunchen/ref-l4
pytorchOfficial

Models

🤗
linhuixiao/Awesome-Visual-Grounding
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Language, Metaphor, and Cognition · Interpreting and Communication in Healthcare