Transformer^-1: Input-Adaptive Computation for Resource-Constrained   Deployment

Lumen AI; Tengzhou No. 1 Middle School; Shihao Ji; Zihui Song; Fucheng; Zhong; Jisen Jia; Zhaobo Wu; Zheyi Cao; Xu Tianhao

arXiv:2501.16394·cs.LG·January 29, 2025

Transformer^-1: Input-Adaptive Computation for Resource-Constrained Deployment

Lumen AI, Tengzhou No. 1 Middle School, Shihao Ji, Zihui Song, Fucheng, Zhong, Jisen Jia, Zhaobo Wu, Zheyi Cao, Xu Tianhao

PDF

Open Access

TL;DR

This paper introduces Transformer^-1, an adaptive architecture that dynamically allocates computational resources based on input complexity, significantly reducing FLOPs and memory usage while maintaining accuracy, suitable for resource-constrained deployment.

Contribution

It presents a novel input-adaptive Transformer architecture with a control mechanism, theoretical efficiency bounds, and engineering solutions, enabling efficient deployment in resource-limited environments.

Findings

01

42.7% reduction in FLOPs on ImageNet-1K

02

34.1% decrease in peak memory usage

03

Maintains accuracy within ±0.3% of standard Transformer

Abstract

Addressing the resource waste caused by fixed computation paradigms in deep learning models under dynamic scenarios, this paper proposes a Transformer $^{- 1}$ architecture based on the principle of deep adaptivity. This architecture achieves dynamic matching between input features and computational resources by establishing a joint optimization model for complexity and computation. Our core contributions include: (1) designing a two-layer control mechanism, composed of a complexity predictor and a reinforcement learning policy network, enabling end-to-end optimization of computation paths; (2) deriving a lower bound theory for dynamic computation, proving the system's theoretical reach to optimal efficiency; and (3) proposing a layer folding technique and a CUDA Graph pre-compilation scheme, overcoming the engineering bottlenecks of dynamic architectures. In the ImageNet-1K benchmark…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Real-Time Systems Scheduling

MethodsAttention Is All You Need · Softmax · Adam · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer