BEExformer: A Fast Inferencing Binarized Transformer with Early Exits

Wazib Ansar; Saptarsi Goswami; and Amlan Chakrabarti

arXiv:2412.05225·cs.CL·May 13, 2026

BEExformer: A Fast Inferencing Binarized Transformer with Early Exits

Wazib Ansar, Saptarsi Goswami, and Amlan Chakrabarti

PDF

TL;DR

BEExformer introduces a binarized transformer with early exit mechanisms and selective learning, significantly reducing model size and FLOPs while improving inference speed and accuracy across NLP tasks.

Contribution

It is the first transformer integrating binarization-aware training with early exit and selective learning to enhance efficiency and performance.

Findings

01

Achieves 21.30x model size reduction.

02

Reduces FLOPs by 52.27%.

03

Improves accuracy by 3.22%.

Abstract

Large Language Models (LLMs) based on transformers achieve cutting-edge results on a variety of applications. However, their enormous size and processing requirements hinder deployment on constrained resources. To enhance efficiency, binarization and Early Exit (EE) have proved to be effective solutions. However, binarization may lead to performance loss as reduced precision affects gradient estimation and parameter updates. Besides, research on EE mechanisms is still in its early stages. To address these challenges, we introduce Binarized Early Exit Transformer (BEExformer), a first-of-its-kind selective learning-based transformer integrating Binarization-Aware Training (BAT) with EE for efficient and fast textual inference. Each transformer block has an integrated Selective-Learn Forget Network (SLFN) to enhance contextual retention while eliminating irrelevant information. The BAT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.