INCRT: An Incremental Transformer That Determines Its Own Architecture

Giansalvo Cirrincione

arXiv:2604.10703·cs.LG·April 14, 2026

INCRT: An Incremental Transformer That Determines Its Own Architecture

Giansalvo Cirrincione

PDF

TL;DR

INCRT introduces an adaptive Transformer that incrementally adds or prunes attention heads during training based on task complexity, reducing redundancy and matching or surpassing BERT-base performance without pre-training.

Contribution

The paper presents INCRT, a novel Transformer architecture that self-determines its structure during training using a geometric criterion, eliminating the need for trial-and-error design.

Findings

01

INCRT's head count aligns with theoretical predictions within 12%.

02

Final architectures are more parameter-efficient, using 3-7 times fewer parameters than BERT-base.

03

INCRT achieves comparable or better performance on benchmark tasks without pre-training.

Abstract

Transformer architectures are designed by trial and error: the number of attention heads, the depth, and the head size are fixed before training begins, with no mathematical principle to guide the choice. The result is systematic structural redundancy -- between half and four-fifths of all heads in a trained model can be removed without measurable loss -- because the architecture allocates capacity without reference to the actual requirements of the task.This paper introduces INCRT (Incremental Transformer), an architecture that determines its own structure during training. Starting from a single head, INCRT adds one attention head at a time whenever its current configuration is provably insufficient, and prunes heads that have become redundant. Each growth decision is driven by a single, online-computable geometric quantity derived from the task's directional structure, requiring no…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.