Capturing Visual Environment Structure Correlates with Control Performance
Jiahua Dong, Yunze Man, Pavel Tokmakov, Yu-Xiong Wang

TL;DR
This paper demonstrates that probing pretrained visual encoders for environment state decoding correlates strongly with control policy performance, offering a new efficient metric for selecting visual representations in robotic manipulation tasks.
Contribution
It introduces a novel probing method to evaluate visual encoders based on their support for decoding environment state, improving representation selection for control.
Findings
Probing accuracy correlates with downstream policy success.
The method outperforms existing proxy metrics.
Encoding physical environment state enhances control generalization.
Abstract
The choice of visual representation is key to scaling generalist robot policies. However, direct evaluation via policy rollouts is expensive, even in simulation. Existing proxy metrics focus on the representation's capacity to capture narrow aspects of the visual world, like object shape, limiting generalization across environments. In this paper, we take an analytical perspective: we probe pretrained visual encoders by measuring how well they support decoding of environment state -- including geometry, object structure, and physical attributes -- from images. Leveraging simulation environments with access to ground-truth state, we show that this probing accuracy strongly correlates with downstream policy performance across diverse environments and learning settings, significantly outperforming prior metrics and enabling efficient representation selection. More broadly, our study…
Peer Reviews
Decision·ICLR 2026 Poster
* The method is simple and efficient. The predicted state covers information about object and scene-level variables, finally providing a single score to policy performance. * The paper evaluates across a breadth of envs, including 3 simulation envs and real-world evaluation on two tasks. * The paper uses a strong set of baselines, including few-shot, action MSE, Depth, etc.
1) The method relies on privileged information from the simulator (state + 2D object boxes) which can not be made available in the real world. It is not clear to me from the current set of experiments if the rankings would correlated if the real world env is substantially different from the simulated environment where the data is collected. 2) As tasks get more complicated, the number of state variables to track will keep on increasing. For example, if the task requires picking up objects of di
- Strong empirical validation: Correlation holds across MetaWorld, RoboCasa, SimplerEnv with multiple seeds and error bars. - Computational efficiency: Good speedups over policy rollouts; actionable for everyday / larger-scale benchmarking. - Unified evaluation setup: Works across environments with a consistent state target. - Actionable insights: Per-dimension/attribute analyses give interpretability; sim-to-real correlation is encouraging. - Clarity: Problem framing and metrics are easy to
-Task diversity and modality coverage: While the results across manipulation benchmarks are compelling, the current evaluation is limited to manipulation-centric settings. Prior work (e.g., VC-1) has shown a form of multi-modality, where visual representations that excel in certain domains (e.g., R3M on MetaWorld) can perform poorly in others (e.g., navigation tasks in Habitat). This raises the question of whether the proposed proxy generalizes across task families with different perceptual and
## Originality The proposed metric is to my knowledge novel and useful. It can also be a useful proxy when designing new visual encoders, as well as for quality control of the final result (and intermediate checkpoints during training), and potentially as additional auxillary loss during training. ## Quality The authors cast a wide net, and systematically explore the effects of many different visual representations in many robotics tasks. They empirically demonstrate that their proxy is both
In general, I would expect that L2 losses on poses would have problems with wrapping causing large error. I didn't notice in the paper where they tackled this potential problem. E.g., a scene where the objects are aligned in a specific manner (lying on a table?) could have problems where close poses end up with very difference values per axis of the pose, throwing off the pose for in a set of tasks. The representation you can get out of a simulator will be a bit limited - it is unlikely to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Reinforcement Learning in Robotics · Robot Manipulation and Learning
