LocateBench: Evaluating the Locating Ability of Vision Language Models
Ting-Rui Chiang, Joshua Robinson, Xinyan Velocity Yu, Dani Yogatama

TL;DR
LocateBench is a new benchmark designed to evaluate how well vision language models can locate objects in images based on natural language instructions, revealing that even the best models still underperform humans.
Contribution
This work introduces LocateBench, a dedicated benchmark for assessing object locating ability in vision language models, and provides a comprehensive evaluation of multiple models and prompting methods.
Findings
GPT-4o achieves over 10% lower accuracy than humans
Prompting approaches significantly affect model performance
LocateBench offers a high-quality standard for future evaluations
Abstract
The ability to locate an object in an image according to natural language instructions is crucial for many real-world applications. In this work we propose LocateBench, a high-quality benchmark dedicated to evaluating this ability. We experiment with multiple prompting approaches, and measure the accuracy of several large vision language models. We find that even the accuracy of the strongest model, GPT-4o, lags behind human accuracy by more than 10%.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Semantic Web and Ontologies
