LocateBench: Evaluating the Locating Ability of Vision Language Models

Ting-Rui Chiang; Joshua Robinson; Xinyan Velocity Yu; Dani Yogatama

arXiv:2410.19808·cs.CV·October 29, 2024

LocateBench: Evaluating the Locating Ability of Vision Language Models

Ting-Rui Chiang, Joshua Robinson, Xinyan Velocity Yu, Dani Yogatama

PDF

Open Access

TL;DR

LocateBench is a new benchmark designed to evaluate how well vision language models can locate objects in images based on natural language instructions, revealing that even the best models still underperform humans.

Contribution

This work introduces LocateBench, a dedicated benchmark for assessing object locating ability in vision language models, and provides a comprehensive evaluation of multiple models and prompting methods.

Findings

01

GPT-4o achieves over 10% lower accuracy than humans

02

Prompting approaches significantly affect model performance

03

LocateBench offers a high-quality standard for future evaluations

Abstract

The ability to locate an object in an image according to natural language instructions is crucial for many real-world applications. In this work we propose LocateBench, a high-quality benchmark dedicated to evaluating this ability. We experiment with multiple prompting approaches, and measure the accuracy of several large vision language models. We find that even the accuracy of the strongest model, GPT-4o, lags behind human accuracy by more than 10%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Semantic Web and Ontologies