BEExformer: A Fast Inferencing Binarized Transformer with Early Exits
Wazib Ansar, Saptarsi Goswami, and Amlan Chakrabarti

TL;DR
BEExformer introduces a binarized transformer with early exit mechanisms and selective learning, significantly reducing model size and FLOPs while improving inference speed and accuracy across NLP tasks.
Contribution
It is the first transformer integrating binarization-aware training with early exit and selective learning to enhance efficiency and performance.
Findings
Achieves 21.30x model size reduction.
Reduces FLOPs by 52.27%.
Improves accuracy by 3.22%.
Abstract
Large Language Models (LLMs) based on transformers achieve cutting-edge results on a variety of applications. However, their enormous size and processing requirements hinder deployment on constrained resources. To enhance efficiency, binarization and Early Exit (EE) have proved to be effective solutions. However, binarization may lead to performance loss as reduced precision affects gradient estimation and parameter updates. Besides, research on EE mechanisms is still in its early stages. To address these challenges, we introduce Binarized Early Exit Transformer (BEExformer), a first-of-its-kind selective learning-based transformer integrating Binarization-Aware Training (BAT) with EE for efficient and fast textual inference. Each transformer block has an integrated Selective-Learn Forget Network (SLFN) to enhance contextual retention while eliminating irrelevant information. The BAT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
