Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

Fenfen Lin; Yesheng Liu; Haiyu Xu; Chen Yue; Zheqi He; Mingxuan Zhao; Miguel Hu Chen; Jiakang Liu; JG Yao; Xi Yang

arXiv:2510.26865·cs.CV·March 25, 2026

Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

Fenfen Lin, Yesheng Liu, Haiyu Xu, Chen Yue, Zheqi He, Mingxuan Zhao, Miguel Hu Chen, Jiakang Liu, JG Yao, Xi Yang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces MeasureBench, a comprehensive benchmark for evaluating vision-language models on visual measurement reading tasks, revealing current models' limitations and exploring reinforcement finetuning improvements.

Contribution

The work provides a new benchmark with a scalable data synthesis pipeline and demonstrates the challenges and potential improvements for VLMs in precise spatial and numeracy tasks.

Findings

01

VLMs struggle with measurement reading tasks.

02

Reinforcement finetuning improves performance on synthetic and real images.

03

Fundamental limitations in fine-grained spatial grounding of VLMs.

Abstract

Reading measurement instruments is effortless for humans and requires relatively little domain expertise, yet it remains surprisingly challenging for current vision-language models (VLMs) as we find in preliminary evaluation. In this work, we introduce MeasureBench, a benchmark on visual measurement reading covering both real-world and synthesized images of various types of measurements, along with an extensible pipeline for data synthesis. Our pipeline procedurally generates a specified type of gauge with controllable visual appearance, enabling scalable variation in key details such as pointers, scales, fonts, lighting, and clutter. Evaluation on popular proprietary and open-weight VLMs shows that even the strongest frontier VLMs struggle with measurement reading in general. We have also conducted preliminary experiments with reinforcement finetuning (RFT) over synthetic data, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

FlagEval/MeasureBench
dataset· 294 dl
294 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Data Visualization and Analytics