TL;DR
This paper critiques the reliance on MACs for measuring vision backbone efficiency, introduces LowFormer with a lightweight attention mechanism, and demonstrates its superior speed and performance across hardware and tasks.
Contribution
It presents LowFormer, a novel vision backbone with a lightweight attention mechanism, and provides insights into hardware-efficient design beyond MACs as a metric.
Findings
LowFormer achieves faster inference on edge and desktop GPUs.
Lowtention outperforms traditional multi-head self-attention in efficiency.
LowFormer maintains high accuracy across various vision tasks.
Abstract
Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
