INCRT: An Incremental Transformer That Determines Its Own Architecture
Giansalvo Cirrincione

TL;DR
INCRT introduces an adaptive Transformer that incrementally adds or prunes attention heads during training based on task complexity, reducing redundancy and matching or surpassing BERT-base performance without pre-training.
Contribution
The paper presents INCRT, a novel Transformer architecture that self-determines its structure during training using a geometric criterion, eliminating the need for trial-and-error design.
Findings
INCRT's head count aligns with theoretical predictions within 12%.
Final architectures are more parameter-efficient, using 3-7 times fewer parameters than BERT-base.
INCRT achieves comparable or better performance on benchmark tasks without pre-training.
Abstract
Transformer architectures are designed by trial and error: the number of attention heads, the depth, and the head size are fixed before training begins, with no mathematical principle to guide the choice. The result is systematic structural redundancy -- between half and four-fifths of all heads in a trained model can be removed without measurable loss -- because the architecture allocates capacity without reference to the actual requirements of the task.This paper introduces INCRT (Incremental Transformer), an architecture that determines its own structure during training. Starting from a single head, INCRT adds one attention head at a time whenever its current configuration is provably insufficient, and prunes heads that have become redundant. Each growth decision is driven by a single, online-computable geometric quantity derived from the task's directional structure, requiring no…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
