Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving

Weitong Lian; Zecong Tang; Haoran Li; Tianjian Gao; Yifei Wang; Zixu Wang; Lingyi Meng; Tengju Ru; Zhejun Cui; Yichen Zhu; Hangshuo Cao; Qi Kang; Tianxing Chen; Yusen Qin; Kaixuan Wang; Yu Zhang

arXiv:2601.21288·cs.AI·January 30, 2026

Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving

Weitong Lian, Zecong Tang, Haoran Li, Tianjian Gao, Yifei Wang, Zixu Wang, Lingyi Meng, Tengju Ru, Zhejun Cui, Yichen Zhu, Hangshuo Cao, Qi Kang, Tianxing Chen, Yusen Qin, Kaixuan Wang, Yu Zhang

PDF

Open Access

TL;DR

Drive-KD introduces a multi-teacher knowledge distillation framework for autonomous driving VLMs, significantly reducing memory and latency while maintaining or improving performance across perception, reasoning, and planning tasks.

Contribution

The paper proposes a novel multi-teacher distillation approach with layer-specific attention and asymmetric gradient projection for autonomous driving VLMs, enhancing efficiency and performance.

Findings

01

Distilled InternVL3-1B outperforms larger models in accuracy and efficiency.

02

Method generalizes across diverse model architectures and scales.

03

Achieves higher throughput with substantially less GPU memory.

Abstract

Autonomous driving is an important and safety-critical task, and recent advances in LLMs/VLMs have opened new possibilities for reasoning and planning in this domain. However, large models demand substantial GPU memory and exhibit high inference latency, while conventional supervised fine-tuning (SFT) often struggles to bridge the capability gaps of small models. To address these limitations, we propose Drive-KD, a framework that decomposes autonomous driving into a "perception-reasoning-planning" triad and transfers these capabilities via knowledge distillation. We identify layer-specific attention as the distillation signal to construct capability-specific single-teacher models that outperform baselines. Moreover, we unify these single-teacher settings into a multi-teacher distillation framework and introduce asymmetric gradient projection to mitigate cross-capability gradient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Autonomous Vehicle Technology and Safety · Reinforcement Learning in Robotics