Degrees of Freedom for Linear Attention: Distilling Softmax Attention with Optimal Feature Efficiency

Naoki Nishikawa; Rei Higuchi; Taiji Suzuki

arXiv:2507.03340·cs.LG·July 8, 2025

Degrees of Freedom for Linear Attention: Distilling Softmax Attention with Optimal Feature Efficiency

Naoki Nishikawa, Rei Higuchi, Taiji Suzuki

PDF

1 Video

TL;DR

This paper introduces a principled approach to automatically determine the feature dimension in linear attention models using statistical degrees of freedom, improving approximation quality and model performance.

Contribution

It proposes a method to select feature dimensions based on statistical degrees of freedom, with theoretical error bounds and layerwise training for nonlinear features.

Findings

01

Improved performance of distilled models over baselines.

02

Smaller approximation error under fixed computational budget.

03

Insights into attention complexity across layers.

Abstract

Linear attention has attracted interest as a computationally efficient approximation to softmax attention, especially for long sequences. Recent studies have explored distilling softmax attention in pre-trained Transformers into linear attention. However, a critical challenge remains: how to choose the feature dimension that governs the approximation quality. Existing methods fix this dimension uniformly across all attention layers, overlooking the diverse roles and complexities of them. In this paper, we propose a principled method to automatically determine the feature dimension in linear attention using the concept of statistical degrees of freedom, which represent the effective dimensionality of the inputs. We provide a theoretical bound on the approximation error and show that the dimension chosen by our method achieves smaller error under a fixed computational budget. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Degrees of Freedom for Linear Attention: Distilling Softmax Attention with Optimal Feature Efficiency· slideslive