Pixie: Fast and Generalizable Supervised Learning of 3D Physics from Pixels
Long Le, Ryan Lucas, Chen Wang, Chuhao Chen, Dinesh Jayaraman, Eric Eaton, Lingjie Liu

TL;DR
Pixie introduces a fast, generalizable neural network that predicts physical properties of 3D scenes from visual data, enabling realistic physics simulation and zero-shot generalization to real-world scenes.
Contribution
The paper presents PIXIE, a novel supervised learning approach that predicts 3D scene physics from images, trained on a large dataset, and capable of fast inference and zero-shot generalization.
Findings
PIXIE outperforms test-time optimization methods by 1.46-4.39x in accuracy.
PIXIE is orders of magnitude faster than existing methods.
Using pretrained features like CLIP enables zero-shot generalization to real scenes.
Abstract
Inferring the physical properties of 3D scenes from visual information is a critical yet challenging task for creating interactive and realistic virtual worlds. While humans intuitively grasp material characteristics such as elasticity or stiffness, existing methods often rely on slow, per-scene optimization, limiting their generalizability and application. To address this problem, we introduce PIXIE, a novel method that trains a generalizable neural network to predict physical properties across multiple scenes from 3D visual features purely using supervised losses. Once trained, our feed-forward network can perform fast inference of plausible material fields, which coupled with a learned static scene representation like Gaussian Splatting enables realistic physics simulation under external forces. To facilitate this research, we also collected PIXIEVERSE, one of the largest known…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is well-written, with a very clear flow of the story. The figures, as well as the supplementary website, are aesthetically pleasing and convey the results pretty well. - This paper is one of the first works that attempt to perform physical parameter prediction of 3D objects in a feed-forward manner, and this indeed leads to a reduced time for test-time optimization. - The qualitative results and renderings look very nice.
- The collected dataset only contains 10 categories of objects. This is partially because of the highly engineered design of their data annotation framework - it seems to be very complex and ad-hoc, and I do not believe that it can be really scaled up to model a more diverse range of object categories, not to mention that there are a lot of objects that simply cannot be categorized into some categories. - There is no evaluation on the reliability of the data curation process. How reliable is the
1. Learning physics is meaningful and important for visual understanding and embodied AIs. 2. The proposed method does not require test-time optimization, making it efficient in parameter estimation. 3. The paper is well-organized and very easy to follow.
Despite the strengths above, this paper has the following weaknesses: 1. The motivation is problematic. The material parameters are purely predicted based on static semantics. Although the paper claims this as an advantage, this is a significant weakness in my opinion. Firstly, even for the same material, parameters can vary widely. For example, rubber can be either hard or soft. Secondly, material properties and mass distribution are interdependent, as demonstrated in [1]. This means that
- The paper introduces PIXIEVERSE, an open-source dataset of 1,624 3D assets annotated with physical material parameters, enabling future research. - The paper proposes the first supervised learning method that directly predicts both discrete material classes and continuous physical parameters (Young’s modulus, Poisson’s ratio, density) from 3D visual features, which enables faster inference than prior test-time optimization methods. - Extensive experiments on synthetic data and real-world data
- The PIXIEVERSE dataset relies heavily on semi-automatic annotations generated by vision-language models. Such labels may contain systematic biases or noise, and their accuracy is not quantitatively validated. - While Figure 4 reports a 2-second inference time, this does not account for the required NeRF/feature-field reconstruction step, which can be computationally expensive. - Although the paper claims zero-shot generalization on Spring-Gaus data, it omits comparisons against Spring-Gaus, in
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
