Bridging Perception and Action: Spatially-Grounded Mid-Level Representations for Robot Generalization

Jonathan Yang; Chuyuan Kelly Fu; Dhruv Shah; Dorsa Sadigh; Fei Xia; Tingnan Zhang

arXiv:2506.06196·cs.RO·June 9, 2025

Bridging Perception and Action: Spatially-Grounded Mid-Level Representations for Robot Generalization

Jonathan Yang, Chuyuan Kelly Fu, Dhruv Shah, Dorsa Sadigh, Fei Xia, Tingnan Zhang

PDF

Open Access

TL;DR

This paper introduces a novel approach using spatially grounded mid-level representations to enhance robot policy learning and generalization in dexterous manipulation tasks, outperforming baseline methods.

Contribution

It proposes a mixture-of-experts policy architecture that leverages interpretable mid-level representations for improved robot policy generalization and performance.

Findings

01

11% higher success rate than language-grounded baseline

02

24% higher success rate than standard diffusion policy

03

10% performance increase using supervised mid-level representations

Abstract

In this work, we investigate how spatially grounded auxiliary representations can provide both broad, high-level grounding as well as direct, actionable information to improve policy learning performance and generalization for dexterous tasks. We study these mid-level representations across three critical dimensions: object-centricity, pose-awareness, and depth-awareness. We use these interpretable mid-level representations to train specialist encoders via supervised learning, then feed them as inputs to a diffusion policy to solve dexterous bimanual manipulation tasks in the real world. We propose a novel mixture-of-experts policy architecture that combines multiple specialized expert models, each trained on a distinct mid-level representation, to improve policy generalization. This method achieves an average success rate that is 11% higher than a language-grounded baseline and 24…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Multimodal Machine Learning Applications