PocketDP3: Efficient Pocket-Scale 3D Visuomotor Policy
Jinhao Zhang, Zhexuan Zhou, Huizhe Li, Yichen Lai, Wenlong Xia, Haoming Song, Youmin Gong, Jie Mei

TL;DR
PocketDP3 introduces a lightweight 3D diffusion policy with a novel MLP-Mixer based architecture, achieving state-of-the-art robotic manipulation performance with significantly fewer parameters and faster inference, suitable for real-time applications.
Contribution
The paper proposes a compact 3D diffusion policy architecture replacing heavy decoders with a lightweight Diffusion Mixer, enabling efficient, real-time robotic manipulation.
Findings
Achieves state-of-the-art results on three benchmarks.
Uses less than 1% of parameters compared to prior methods.
Supports two-step inference without performance loss.
Abstract
Recently, 3D vision-based diffusion policies have shown strong capability in learning complex robotic manipulation skills. However, a common architectural mismatch exists in these models: a tiny yet efficient point-cloud encoder is often paired with a massive decoder. Given a compact scene representation, we argue that this may lead to substantial parameter waste in the decoder. Motivated by this observation, we propose PocketDP3, a pocket-scale 3D diffusion policy that replaces the heavy conditional U-Net decoder used in prior methods with a lightweight Diffusion Mixer (DiM) built on MLP-Mixer blocks. This architecture enables efficient fusion across temporal and channel dimensions, significantly reducing model size. Notably, without any additional consistency distillation techniques, our method supports two-step inference without sacrificing performance, improving practicality for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis
