Physically Guided Visual Mass Estimation from a Single RGB Image

Sungjae Lee; Junhan Jeong; Yeonjoo Hong; Kwang In Kim

arXiv:2601.20303·cs.CV·May 6, 2026

Physically Guided Visual Mass Estimation from a Single RGB Image

Sungjae Lee, Junhan Jeong, Yeonjoo Hong, Kwang In Kim

PDF

TL;DR

This paper introduces a physically structured framework for estimating object mass from a single RGB image by combining geometry, semantics, and appearance cues, outperforming existing methods.

Contribution

It presents a novel approach that aligns visual cues with physical factors like volume and density, using monocular depth and vision-language models for improved mass estimation.

Findings

01

Outperforms state-of-the-art methods on image2mass and ABO-500 datasets.

02

Effectively combines geometry, semantics, and appearance for mass prediction.

03

Uses physically guided latent factors with mass-only supervision.

Abstract

Estimating object mass from visual input is challenging because mass depends jointly on geometric volume and material-dependent density, neither of which is directly observable from RGB appearance. Consequently, mass prediction from pixels is ill-posed and therefore benefits from physically meaningful representations to constrain the space of plausible solutions. We propose a physically structured framework for single-image mass estimation that addresses this ambiguity by aligning visual cues with the physical factors governing mass. From a single RGB image, we recover object-centric three-dimensional geometry via monocular depth estimation to inform volume and extract coarse material semantics using a vision-language model to guide density-related reasoning. These geometry, semantic, and appearance representations are fused through an instance-adaptive gating mechanism, and two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.