Bitformer: An efficient Transformer with bitwise operation-based   attention for Big Data Analytics at low-cost low-precision devices

Gaoxiang Duan; Junkai Zhang; Xiaoying Zheng; Yongxin Zhu; Victor; Chang

arXiv:2311.13502·cs.LG·September 3, 2024·2 cites

Bitformer: An efficient Transformer with bitwise operation-based attention for Big Data Analytics at low-cost low-precision devices

Gaoxiang Duan, Junkai Zhang, Xiaoying Zheng, Yongxin Zhu, Victor, Chang

PDF

Open Access

TL;DR

Bitformer introduces a novel Transformer variant utilizing bitwise operations in its attention mechanism, significantly reducing computational complexity and enabling efficient deployment on low-resource edge devices without sacrificing performance.

Contribution

The paper presents a new attention mechanism based on bitwise operations that replaces floating-point matrix multiplication, reducing complexity and making Transformers suitable for low-power, low-precision devices.

Findings

01

Maintains ability to model long-range dependencies.

02

Reduces attention computation complexity from O(n^2d) to O(n^2T).

03

Enables efficient edge computing deployment.

Abstract

In the current landscape of large models, the Transformer stands as a cornerstone, playing a pivotal role in shaping the trajectory of modern models. However, its application encounters challenges attributed to the substantial computational intricacies intrinsic to its attention mechanism. Moreover, its reliance on high-precision floating-point operations presents specific hurdles, particularly evident in computation-intensive scenarios such as edge computing environments. These environments, characterized by resource-constrained devices and a preference for lower precision, necessitate innovative solutions. To tackle the exacting data processing demands posed by edge devices, we introduce the Bitformer model, an inventive extension of the Transformer paradigm. Central to this innovation is a novel attention mechanism that adeptly replaces conventional floating-point matrix…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques · Quantum Computing Algorithms and Architecture

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Dropout · Softmax · Absolute Position Encodings · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing