Bitformer: An efficient Transformer with bitwise operation-based attention for Big Data Analytics at low-cost low-precision devices
Gaoxiang Duan, Junkai Zhang, Xiaoying Zheng, Yongxin Zhu, Victor, Chang

TL;DR
Bitformer introduces a novel Transformer variant utilizing bitwise operations in its attention mechanism, significantly reducing computational complexity and enabling efficient deployment on low-resource edge devices without sacrificing performance.
Contribution
The paper presents a new attention mechanism based on bitwise operations that replaces floating-point matrix multiplication, reducing complexity and making Transformers suitable for low-power, low-precision devices.
Findings
Maintains ability to model long-range dependencies.
Reduces attention computation complexity from O(n^2d) to O(n^2T).
Enables efficient edge computing deployment.
Abstract
In the current landscape of large models, the Transformer stands as a cornerstone, playing a pivotal role in shaping the trajectory of modern models. However, its application encounters challenges attributed to the substantial computational intricacies intrinsic to its attention mechanism. Moreover, its reliance on high-precision floating-point operations presents specific hurdles, particularly evident in computation-intensive scenarios such as edge computing environments. These environments, characterized by resource-constrained devices and a preference for lower precision, necessitate innovative solutions. To tackle the exacting data processing demands posed by edge devices, we introduce the Bitformer model, an inventive extension of the Transformer paradigm. Central to this innovation is a novel attention mechanism that adeptly replaces conventional floating-point matrix…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques · Quantum Computing Algorithms and Architecture
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Dropout · Softmax · Absolute Position Encodings · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing
