Are Large Kernels Better Teachers than Transformers for ConvNets?
Tianjin Huang, Lu Yin, Zhenyu Zhang, Li Shen, Meng Fang, Mykola, Pechenizkiy, Zhangyang Wang, Shiwei Liu

TL;DR
This study demonstrates that large-kernel ConvNets serve as highly effective teachers for small-kernel ConvNets in knowledge distillation, outperforming Transformers and achieving state-of-the-art results on ImageNet.
Contribution
It is the first to show large-kernel ConvNets are superior teachers for small-kernel ConvNets in knowledge distillation, leading to improved performance and transfer of beneficial characteristics.
Findings
Large-kernel ConvNets outperform Transformers as teachers in KD.
Achieved the best-ever pure ConvNet with 83.1% top-1 accuracy on ImageNet.
Beneficial properties like larger receptive fields are transferred through KD.
Abstract
This paper reveals a new appeal of the recently emerged large-kernel Convolutional Neural Networks (ConvNets): as the teacher in Knowledge Distillation (KD) for small-kernel ConvNets. While Transformers have led state-of-the-art (SOTA) performance in various fields with ever-larger models and labeled data, small-kernel ConvNets are considered more suitable for resource-limited applications due to the efficient convolution operation and compact weight sharing. KD is widely used to boost the performance of small-kernel ConvNets. However, previous research shows that it is not quite effective to distill knowledge (e.g., global information) from Transformers to small-kernel ConvNets, presumably due to their disparate architectures. We hereby carry out a first-of-its-kind study unveiling that modern large-kernel ConvNets, a compelling competitor to Vision Transformers, are remarkably more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · COVID-19 diagnosis using AI · Domain Adaptation and Few-Shot Learning
MethodsConvNeXt · Knowledge Distillation · Convolution
