Transformer Scalability Crisis: The First Comprehensive Empirical Analysis of Performance Walls in Modern Language Models
Mahdi Naser Moghadasi, Faezeh Ghaderi

TL;DR
This paper systematically evaluates 118 transformer models, revealing fundamental performance limitations at longer sequence lengths and challenging assumptions about their scalability in real-world applications.
Contribution
It provides the first comprehensive empirical analysis of transformer performance walls, uncovering critical scalability issues and establishing new benchmarking methodologies.
Findings
88.1% of models process up to 512 tokens
Only 44.9% process 1024 tokens successfully
Compressed models outperform large models in efficiency
Abstract
Despite the remarkable success of transformer architectures in natural language processing, their scalability limitations remain poorly understood through systematic empirical analysis. This paper presents the first comprehensive large-scale evaluation of 118 transformer models across seven distinct architectural categories, revealing fundamental performance walls that manifest as hard deployment constraints. Our systematic benchmarking methodology uncovers a critical scalability crisis: while 88.1% of models successfully process sequences up to 512 tokens, this drops dramatically to 44.9% at 1024 tokens, with complete failure (0%) at 2048 tokens. Through rigorous analysis of loading times, memory consumption, and computational efficiency across sequence lengths from 128 to 2048 tokens, we demonstrate that compressed models achieve superior parameter efficiency (649.2 tokens/sec/M…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
