X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning

Maanping Shao; Feihong Zhang; Gu Zhang; Baiye Cheng; Zhengrong Xue; Huazhe Xu

arXiv:2601.11269·cs.CV·January 19, 2026

X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning

Maanping Shao, Feihong Zhang, Gu Zhang, Baiye Cheng, Zhengrong Xue, Huazhe Xu

PDF

Open Access

TL;DR

X-Distill introduces a cross-architecture knowledge distillation method that transfers rich visual features from large Vision Transformers to compact CNNs, significantly improving data-efficient visuomotor learning in robotics.

Contribution

The paper presents a novel offline distillation approach that combines large ViT models with compact CNNs for improved robotic manipulation performance.

Findings

01

Outperforms from-scratch ResNet policies on 34 benchmarks

02

Surpasses fine-tuned DINOv2 encoders in real-world tasks

03

Achieves state-of-the-art data efficiency in robotic manipulation

Abstract

Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly fine-tuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on $34$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning