Latency Adjustable Transformer Encoder for Language Understanding

Sajjad Kachuee; Mohammad Sharifkhani

arXiv:2201.03327·cs.CL·September 20, 2024

Latency Adjustable Transformer Encoder for Language Understanding

Sajjad Kachuee, Mohammad Sharifkhani

PDF

Open Access

TL;DR

This paper introduces a Transformer architecture that adaptively adjusts inference latency by removing less important word-vectors during fine-tuning, enabling flexible speed-accuracy trade-offs without additional training.

Contribution

It proposes a novel Attention Context Contribution metric and an offline-tuning method for latency adjustment in Transformer models, enhancing efficiency without significant performance loss.

Findings

01

Up to 2.9x improvement in Time-to-First-Token for Llama3

02

Effective removal of less important word-vectors in higher layers

03

Minimal impact on global context and task performance

Abstract

Adjusting the latency, power, and accuracy of natural language understanding models is a desirable objective of an efficient architecture. This paper proposes an efficient Transformer architecture that adjusts the inference computational cost adaptively with a desired inference latency speedup. In fine-tuning phase, the proposed method detects less important hidden sequence elements (word-vectors) and eliminates them in each encoder layer using a proposed Attention Context Contribution (ACC) metric. After the fine-tuning phase, with the novel offline-tuning property, the inference latency of the model can be adjusted in a wide range of inference speedup selections without any further training. Extensive experiments reveal that most word-vectors in higher Transformer layers contribute less to subsequent layers, allowing their removal to improve inference latency. Experimental results on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Sentiment Analysis and Opinion Mining

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Transformer · Attention Dropout · Dropout