Bridging Perception and Action: Spatially-Grounded Mid-Level Representations for Robot Generalization
Jonathan Yang, Chuyuan Kelly Fu, Dhruv Shah, Dorsa Sadigh, Fei Xia, Tingnan Zhang

TL;DR
This paper introduces a novel approach using spatially grounded mid-level representations to enhance robot policy learning and generalization in dexterous manipulation tasks, outperforming baseline methods.
Contribution
It proposes a mixture-of-experts policy architecture that leverages interpretable mid-level representations for improved robot policy generalization and performance.
Findings
11% higher success rate than language-grounded baseline
24% higher success rate than standard diffusion policy
10% performance increase using supervised mid-level representations
Abstract
In this work, we investigate how spatially grounded auxiliary representations can provide both broad, high-level grounding as well as direct, actionable information to improve policy learning performance and generalization for dexterous tasks. We study these mid-level representations across three critical dimensions: object-centricity, pose-awareness, and depth-awareness. We use these interpretable mid-level representations to train specialist encoders via supervised learning, then feed them as inputs to a diffusion policy to solve dexterous bimanual manipulation tasks in the real world. We propose a novel mixture-of-experts policy architecture that combines multiple specialized expert models, each trained on a distinct mid-level representation, to improve policy generalization. This method achieves an average success rate that is 11% higher than a language-grounded baseline and 24…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Multimodal Machine Learning Applications
