Two Heads Are Better than One: Simulating Large Transformers with Small Ones
Hantao Yu, Josh Alman

TL;DR
This paper demonstrates that large transformers processing long sequences can be efficiently simulated by multiple small transformers, leveraging hardware optimization for short inputs, with theoretical bounds and practical scenarios analyzed.
Contribution
It introduces a method to simulate large transformers with small ones, providing theoretical bounds and practical scenarios where this approach is effective.
Findings
Large transformers can be simulated by O((N/M)^2) small transformers.
In natural scenarios, only O(N/M) small transformers are needed.
Theoretical bounds are proven to be optimal in the worst case.
Abstract
The quadratic complexity of self-attention prevents transformers from scaling effectively to long input sequences. On the other hand, modern GPUs and other specialized hardware accelerators are well-optimized for processing small input sequences in transformers during both training and inference. A natural question arises: can we take advantage of the efficiency of small transformers to deal with long input sequences? In this paper, we show that transformers with long input sequences (large transformers) can be efficiently simulated by transformers that can only take short input sequences (small transformers). Specifically, we prove that any transformer with input length can be efficiently simulated by only transformers with input length , and that this cannot be improved in the worst case. However, we then prove that in various natural scenarios including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Stochastic Gradient Optimization Techniques · Parallel Computing and Optimization Techniques
