Light-Weight Vision Transformer with Parallel Local and Global   Self-Attention

Nikolas Ebert; Laurenz Reichardt; Didier Stricker; Oliver; Wasenm\"uller

arXiv:2307.09120·cs.CV·July 19, 2023

Light-Weight Vision Transformer with Parallel Local and Global Self-Attention

Nikolas Ebert, Laurenz Reichardt, Didier Stricker, Oliver, Wasenm\"uller

PDF

Open Access

TL;DR

This paper introduces a compact, efficient vision transformer architecture optimized for resource-constrained autonomous driving applications, achieving high accuracy with significantly fewer parameters.

Contribution

Redesigns the state-of-the-art PLG-ViT into a lightweight model with reduced complexity, suitable for real-time autonomous driving tasks, with two optimized variants.

Findings

01

Reduced model size by a factor of 5 with moderate performance drop

02

Achieved 79.5% top-1 accuracy on ImageNet-1K with only 5 million parameters

03

Demonstrated strong performance on COCO and autonomous driving tasks

Abstract

While transformer architectures have dominated computer vision in recent years, these models cannot easily be deployed on hardware with limited resources for autonomous driving tasks that require real-time-performance. Their computational complexity and memory requirements limits their use, especially for applications with high-resolution inputs. In our work, we redesign the powerful state-of-the-art Vision Transformer PLG-ViT to a much more compact and efficient architecture that is suitable for such tasks. We identify computationally expensive blocks in the original PLG-ViT architecture and propose several redesigns aimed at reducing the number of parameters and floating-point operations. As a result of our redesign, we are able to reduce PLG-ViT in size by a factor of 5, with a moderate drop in performance. We propose two variants, optimized for the best trade-off between parameter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Advanced Neural Network Applications · Image Processing Techniques and Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection · Absolute Position Encodings · Adam · Layer Normalization · Label Smoothing