Light-Weight Vision Transformer with Parallel Local and Global Self-Attention
Nikolas Ebert, Laurenz Reichardt, Didier Stricker, Oliver, Wasenm\"uller

TL;DR
This paper introduces a compact, efficient vision transformer architecture optimized for resource-constrained autonomous driving applications, achieving high accuracy with significantly fewer parameters.
Contribution
Redesigns the state-of-the-art PLG-ViT into a lightweight model with reduced complexity, suitable for real-time autonomous driving tasks, with two optimized variants.
Findings
Reduced model size by a factor of 5 with moderate performance drop
Achieved 79.5% top-1 accuracy on ImageNet-1K with only 5 million parameters
Demonstrated strong performance on COCO and autonomous driving tasks
Abstract
While transformer architectures have dominated computer vision in recent years, these models cannot easily be deployed on hardware with limited resources for autonomous driving tasks that require real-time-performance. Their computational complexity and memory requirements limits their use, especially for applications with high-resolution inputs. In our work, we redesign the powerful state-of-the-art Vision Transformer PLG-ViT to a much more compact and efficient architecture that is suitable for such tasks. We identify computationally expensive blocks in the original PLG-ViT architecture and propose several redesigns aimed at reducing the number of parameters and floating-point operations. As a result of our redesign, we are able to reduce PLG-ViT in size by a factor of 5, with a moderate drop in performance. We propose two variants, optimized for the best trade-off between parameter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Advanced Neural Network Applications · Image Processing Techniques and Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection · Absolute Position Encodings · Adam · Layer Normalization · Label Smoothing
