Training Transformers in Cosine Coefficient Space

Mohamed Amine Bergach

arXiv:2604.04440·cs.PF·April 10, 2026

Training Transformers in Cosine Coefficient Space

Mohamed Amine Bergach

PDF

TL;DR

This paper introduces a method to train transformer linear layers using a limited set of DCT coefficients, reducing parameters while maintaining comparable performance to dense models.

Contribution

The authors propose a novel parameterization of transformer weights via DCT coefficients, demonstrating competitive performance with fewer trainable parameters and improved memory efficiency.

Findings

01

Transformer performance with DCT coefficients is close to dense baseline at half the parameters.

02

Rank flexibility allows high-rank matrices to be represented in low-rank subspaces without significant loss.

03

DCT basis offers computational advantages due to its separable fast transform and on-chip memory residency.

Abstract

Linear layers hold most of a transformer's parameters. We replace each linear layer with one that stores $K$ out of $mn$ two-dimensional DCT coefficients per weight matrix and reconstructs the full matrix through an inverse DCT at every forward pass; the $K$ coefficients are the trainable parameters. A 4-layer, 128-dim transformer trained from scratch on character-level Shakespeare reaches validation loss $1.604$ at $K = mn /2$ , against $1.580$ for a standard dense baseline -- a gap of $+ 0.024$ at half the trainable parameter count, within the terminal-epoch variation of the dense run. A rank-48 LoRA factorization at the same trainable parameter count reaches only $1.801$ ( $+ 0.221$ ). The structural advantage of sparse-coefficient over low-rank parameterizations at matched $K$ is qualitative. We identify rank flexibility as the mechanism. A random orthonormal basis matches the DCT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.