PhysQuantAgent: An Inference Pipeline of Mass Estimation for Vision-Language Models

Hisayuki Yokomizo; Taiki Miyanishi; Yan Gang; Shuhei Kurita; Nakamasa Inoue; Yusuke Iwasawa

arXiv:2603.16958·cs.CV·March 19, 2026

PhysQuantAgent: An Inference Pipeline of Mass Estimation for Vision-Language Models

Hisayuki Yokomizo, Taiki Miyanishi, Yan Gang, Shuhei Kurita, Nakamasa Inoue, Yusuke Iwasawa

PDF

Open Access

TL;DR

This paper introduces PhysQuantAgent, a framework leveraging vision-language models and a new benchmark dataset to improve real-world object mass estimation for robotic manipulation, demonstrating significant accuracy gains through visual prompting methods.

Contribution

The work presents a novel framework and benchmark for physical property estimation using VLMs, incorporating visual prompting techniques to enhance mass inference accuracy in real-world scenarios.

Findings

01

Visual prompting improves mass estimation accuracy

02

The benchmark dataset enables realistic evaluation of physical reasoning

03

Spatial reasoning enhances VLM capabilities for physical inference

Abstract

Vision-Language Models (VLMs) are increasingly applied to robotic perception and manipulation, yet their ability to infer physical properties required for manipulation remains limited. In particular, estimating the mass of real-world objects is essential for determining appropriate grasp force and ensuring safe interaction. However, current VLMs lack reliable mass reasoning capabilities, and most existing benchmarks do not explicitly evaluate physical quantity estimation under realistic sensing conditions. In this work, we propose PhysQuantAgent, a framework for real-world object mass estimation using VLMs, together with VisPhysQuant, a new benchmark dataset for evaluation. VisPhysQuant consists of RGB-D videos of real objects captured from multiple viewpoints, annotated with precise mass measurements. To improve estimation accuracy, we introduce three visual prompting methods that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications