Latency Adjustable Transformer Encoder for Language Understanding
Sajjad Kachuee, Mohammad Sharifkhani

TL;DR
This paper introduces a Transformer architecture that adaptively adjusts inference latency by removing less important word-vectors during fine-tuning, enabling flexible speed-accuracy trade-offs without additional training.
Contribution
It proposes a novel Attention Context Contribution metric and an offline-tuning method for latency adjustment in Transformer models, enhancing efficiency without significant performance loss.
Findings
Up to 2.9x improvement in Time-to-First-Token for Llama3
Effective removal of less important word-vectors in higher layers
Minimal impact on global context and task performance
Abstract
Adjusting the latency, power, and accuracy of natural language understanding models is a desirable objective of an efficient architecture. This paper proposes an efficient Transformer architecture that adjusts the inference computational cost adaptively with a desired inference latency speedup. In fine-tuning phase, the proposed method detects less important hidden sequence elements (word-vectors) and eliminates them in each encoder layer using a proposed Attention Context Contribution (ACC) metric. After the fine-tuning phase, with the novel offline-tuning property, the inference latency of the model can be adjusted in a wide range of inference speedup selections without any further training. Extensive experiments reveal that most word-vectors in higher Transformer layers contribute less to subsequent layers, allowing their removal to improve inference latency. Experimental results on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Sentiment Analysis and Opinion Mining
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Transformer · Attention Dropout · Dropout
