Image Recognition with Online Lightweight Vision Transformer: A Survey

Zherui Zhang; Rongtao Xu; Jie Zhou; Changwei Wang; Xingtian Pei; Wenhao Xu; Jiguang Zhang; Li Guo; Longxiang Gao; Wenbo Xu; Shibiao Xu

arXiv:2505.03113·cs.CV·September 29, 2025

Image Recognition with Online Lightweight Vision Transformer: A Survey

Zherui Zhang, Rongtao Xu, Jie Zhou, Changwei Wang, Xingtian Pei, Wenhao Xu, Jiguang Zhang, Li Guo, Longxiang Gao, Wenbo Xu, Shibiao Xu

PDF

TL;DR

This survey reviews online strategies for creating lightweight vision transformers for image recognition, analyzing their trade-offs and proposing future research directions to improve efficiency and applicability.

Contribution

It systematically evaluates lightweight vision transformer techniques on ImageNet-1K, highlighting their advantages, disadvantages, and potential for real-world deployment.

Findings

01

Efficient component design improves model speed with minimal accuracy loss

02

Dynamic networks adapt complexity based on input, balancing performance and efficiency

03

Knowledge distillation enhances lightweight models by transferring knowledge from larger networks

Abstract

The Transformer architecture has achieved significant success in natural language processing, motivating its adaptation to computer vision tasks. Unlike convolutional neural networks, vision transformers inherently capture long-range dependencies and enable parallel processing, yet lack inductive biases and efficiency benefits, facing significant computational and memory challenges that limit its real-world applicability. This paper surveys various online strategies for generating lightweight vision transformers for image recognition, focusing on three key areas: Efficient Component Design, Dynamic Network, and Knowledge Distillation. We evaluate the relevant exploration for each topic on the ImageNet-1K benchmark, analyzing trade-offs among precision, parameters, throughput, and more to highlight their respective advantages, disadvantages, and flexibility. Finally, we propose future…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Adam · Dropout · Knowledge Distillation · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding