NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches
Ethan Smith (Canva Research)

TL;DR
NOBLE introduces nonlinear low-rank branches into transformer layers, significantly improving training efficiency and speed with minimal additional parameters across various models, including LLMs, BERT, VQGAN, and ViT.
Contribution
This work presents NOBLE, a novel architectural augmentation that adds nonlinear low-rank branches for pretraining transformers from scratch, outperforming existing PEFT methods in efficiency.
Findings
Achieves up to 1.47x step speedup with minimal overhead.
Improves training efficiency across multiple model types.
Discovered that certain augmentations interfere with NOBLE's benefits.
Abstract
We introduce NOBLE (Nonlinear lOw-rank Branch for Linear Enhancement), an architectural augmentation that adds nonlinear low-rank branches to transformer linear layers. Unlike LoRA and other parameter-efficient fine-tuning (PEFT) methods, NOBLE is designed for pretraining from scratch. The branch is a permanent part of the architecture as opposed to an adapter for finetuning on top of frozen weights. The branch computes {\sigma}(xWdown)Wup where {\sigma} is a learnable nonlinearity. We evaluate several activation functions and find that CosNet, a two-layer cosine nonlinearity with learnable frequency and phase with a linear projection in between them in the bottleneck space, performs best. NOBLE achieves substantial improvements with minimal overhead: up to 1.47x step speedup to reach baseline eval loss (up to 32% fewer training steps), with as low as 4% additional parameters and 7%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Image Enhancement Techniques · Generative Adversarial Networks and Image Synthesis
