Robot-DIFT: Distilling Diffusion Features for Geometrically Consistent Visuomotor Control
Yu Deng, Yufeng Jin, Xiaogang Jia, Jiahong Xue, Gerhard Neumann, Georgia Chalvatzaki

TL;DR
Robot-DIFT leverages diffusion models to encode geometric features for robot manipulation, distilling this knowledge into a stable, real-time control network that outperforms traditional discriminative methods.
Contribution
The paper introduces Robot-DIFT, a novel framework that distills diffusion-based geometric features into a deterministic network for improved visuomotor control.
Findings
Robot-DIFT achieves superior geometric consistency in manipulation tasks.
The framework demonstrates robustness against drift and real-time inference capability.
Pretrained on large-scale datasets, it outperforms existing discriminative baselines.
Abstract
We hypothesize that a key bottleneck in generalizable robot manipulation is not solely data scale or policy capacity, but a structural mismatch between current visual backbones and the physical requirements of closed-loop control. While state-of-the-art vision encoders (including those used in VLAs) optimize for semantic invariance to stabilize classification, manipulation typically demands geometric sensitivity the ability to map millimeter-level pose shifts to predictable feature changes. Their discriminative objective creates a "blind spot" for fine-grained control, whereas generative diffusion models inherently encode geometric dependencies within their latent manifolds, encouraging the preservation of dense multi-scale spatial structure. However, directly deploying stochastic diffusion features for control is hindered by stochastic instability, inference latency, and representation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Advanced Vision and Imaging · Reinforcement Learning in Robotics
