TL;DR
VoMP is a fast, learned method that predicts detailed volumetric mechanical properties of 3D objects using a geometry transformer and physically plausible material latents, improving accuracy over previous approaches.
Contribution
It introduces a novel feed-forward approach with a geometry transformer and a physically constrained material manifold for volumetric property prediction.
Findings
VoMP outperforms prior methods in accuracy.
VoMP is significantly faster than existing techniques.
The method effectively predicts properties across diverse 3D representations.
Abstract
Physical simulation relies on spatially-varying mechanical properties, often laboriously hand-crafted. VoMP is a feed-forward method trained to predict Young's modulus (), Poisson's ratio (), and density () throughout the volume of 3D objects, in any representation that can be rendered and voxelized. VoMP aggregates per-voxel multi-view features and passes them to our trained Geometry Transformer to predict per-voxel material latent codes. These latents reside on a manifold of physically plausible materials, which we learn from a real-world dataset, guaranteeing the validity of decoded per-voxel materials. To obtain object-level training data, we propose an annotation pipeline combining knowledge from segmented 3D datasets, material databases, and a vision-language model, along with a new benchmark. Experiments show that VoMP estimates accurate volumetric properties, far…
Peer Reviews
Decision·ICLR 2026 Poster
1. New Dataset: The GVM pipeline combines part segmentation + PBR textures + VLM prompts constrained by a curated materials range table to annotate 37M voxels, a scale jump over earlier works that only annotated sparse points. This will be useful to the community and encourage future research in this under-explored area. 2. Volumetric Representation: Unlike previous models that focus heavily on surface properties or struggle with interior prediction (e.g., NeRF2Physics and PUGS due to feature f
I do not have a major concern, several minor points: 1. Reliance on Approximate Input and Voxelization Resolution: The final output resolution is limited by the fixed-grid voxelization. This can lead to oversmoothing in highly heterogeneous regions and approximation errors when mapping results back to highly detailed input geometry, especially for thin structures or internal complexities. Also the authors do not provide a comprehensive study on the impact of input & output resolution. 2. Assumpt
- Representation-agnostic, feed-forward pipeline that predicts per-voxel mechanical properties in seconds, with outputs directly usable in simulators. - Data contribution: a VLM-based pipeline that annotates ~1.6K part-segmented 3D shapes with **volumetric** materials; unlike PIXIE’s Pixelverse (surface-biased), this provides volumetric supervision. - Extensive quantitative and qualitative results with detailed ablations, plus realistic FEM simulations. - Clear writing and thorough explanations.
- Scalability is limited by the size and diversity of the training set (≈1.3K shapes) and the need for part-segmented annotations, making it harder to scale than methods that avoid volumetric labeling. - Most experiments use author-curated data; because the test set follows a similar distribution, generalization to independent datasets (e.g., Objaverse) is uncertain. - Resolution is bounded by fixed-grid voxelization, which restricts fine-detail fidelity.
1. The core design is novel and effective. The problem is decoupled into two parts, which guarantee valid physical latent space learning and fast plausible prediction. The ablation study confirms the superiority of the proposed architecture 2. The feedforward nature achieves inference in seconds compared to the previous optimization approach, making it a useful tool for scalable simulation pipeline setup 3. The paper includes a comprehensive set of experiments, including strong quantitative comp
1. One remaining concern is the reliance on VLM-generated ground-truth. An expert-annotated dataset is hard to get; hence, the usage of VLMs and external knowledge is an interesting point for data construction. But how to guarantee the correctness of the VLM predicted value is also crucial, especially for real future downstream task usage. 2. The Geometry Transformer is trained using a fixed $64^3$ voxel grid, which is rather low resolution for complex 3D assets. This seems like a major bottlen
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
