Training Transformers in Cosine Coefficient Space
Mohamed Amine Bergach

TL;DR
This paper introduces a method to train transformer linear layers using a limited set of DCT coefficients, reducing parameters while maintaining comparable performance to dense models.
Contribution
The authors propose a novel parameterization of transformer weights via DCT coefficients, demonstrating competitive performance with fewer trainable parameters and improved memory efficiency.
Findings
Transformer performance with DCT coefficients is close to dense baseline at half the parameters.
Rank flexibility allows high-rank matrices to be represented in low-rank subspaces without significant loss.
DCT basis offers computational advantages due to its separable fast transform and on-chip memory residency.
Abstract
Linear layers hold most of a transformer's parameters. We replace each linear layer with one that stores out of two-dimensional DCT coefficients per weight matrix and reconstructs the full matrix through an inverse DCT at every forward pass; the coefficients are the trainable parameters. A 4-layer, 128-dim transformer trained from scratch on character-level Shakespeare reaches validation loss at , against for a standard dense baseline -- a gap of at half the trainable parameter count, within the terminal-epoch variation of the dense run. A rank-48 LoRA factorization at the same trainable parameter count reaches only (). The structural advantage of sparse-coefficient over low-rank parameterizations at matched is qualitative. We identify rank flexibility as the mechanism. A random orthonormal basis matches the DCT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
