Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning

Junlin Wang; Zhiyun Lin

arXiv:2505.18487·cs.RO·February 17, 2026

Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning

Junlin Wang, Zhiyun Lin

PDF

1 Repo 1 Datasets 3 Reviews

TL;DR

This paper introduces ICon, a contrastive learning method for Vision Transformers that enhances robotic manipulation by embedding body-specific cues into visual representations, leading to improved policy learning and transfer.

Contribution

We propose ICon, a novel contrastive learning approach that separates agent-specific and environment-specific features in ViTs, improving manipulation policy efficiency and transferability.

Findings

01

Enhanced manipulation policy performance across tasks

02

Facilitated transfer of policies between robots

03

Improved agent-centric visual representations

Abstract

Learning effective visual representations for robotic manipulation remains a fundamental challenge due to the complex body dynamics involved in action execution. In this paper, we study how visual representations that carry body-relevant cues can enable efficient policy learning for downstream robotic manipulation tasks. We present $I$ nter-token $Con$ trast ( $ICon$ ), a contrastive learning method applied to the token-level representations of Vision Transformers (ViTs). ICon enforces a separation in the feature space between agent-specific and environment-specific tokens, resulting in agent-centric visual representations that embed body-specific inductive biases. This framework can be seamlessly integrated into end-to-end policy learning by incorporating the contrastive loss as an auxiliary objective. Our experiments show that ICon not only improves policy…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

1) The approach to disentangle agent-specific and environment specific features is technically sound. The FPS also helps obtain more useful representations w.r.t. random sampling. 2) The improvement in success rates for a few tasks in Robosuite and RLBench seems to be consistent. 3) With little finetuning the method seems to easily transfer across different robots although it is not clear how much tuning is needed and how well the method would perform zero-shot.

Weaknesses

* The benchmarks used to validate the method are limited. Training concurrently ViT with the policy also seems to limit the ability to do some pretraining. I believe it would be useful if the objective was also used not only concurrently with the policy, but with a large pretraining setup that can later be used out of the box for variosu downsteram tasks without tuning. * How much fine-tuning is needed ? Also would it work out of the box ? I believe that at its current state since we train on o

Reviewer 02Rating 6Confidence 4

Strengths

* The utilization of FPS over 2D feature/token maps to ensure coverage is interesting. * I appreciate the stability analysis. * Nice video visualizations on the project website. * Open-source data and code!

Weaknesses

* The method relies on a pre-trained supervised segmentation model. While it is true that these models are performing very well, they are still prone to wrong detection or missing objects for unseen data. * The masking procedure relies on heuristic (class thresholding) where it is unclear what would happen if part of the gripper and a small object (e.g., a small block) occupy the same tokens. * I found the multi-level contrastive loss description in L246-251 confusing: I don’t understand how $\g

Reviewer 03Rating 2Confidence 4

Strengths

**Overview** - Well written paper. - Method seems novel. - Potentially applicable to any policy that employs a ViT image encoder. - Evaluated on numerous environments. I am willing to raise my score if the points raised in the Weaknesses and Questions sections are addressed.

Weaknesses

**Overview** - The specific method is not well-motivated other than the agent-environment disentanglement. - No comparison with other segmentation-based representation learning objectives. - Information loss in produced token masks compared to the original pixel mask. - Empirical performance gains are not significant. **Method Motivation** Why is your specific approach better than others for acquiring agent-environment disentanglement in feature space? This is neither discussed nor empirically

Code & Models

Repositories

henrywjl/icon
pytorchOfficial

Datasets

HenryWJL/icon
dataset· 11 dl
11 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Learning