TL;DR
This paper introduces a principled approach to automatically determine the feature dimension in linear attention models using statistical degrees of freedom, improving approximation quality and model performance.
Contribution
It proposes a method to select feature dimensions based on statistical degrees of freedom, with theoretical error bounds and layerwise training for nonlinear features.
Findings
Improved performance of distilled models over baselines.
Smaller approximation error under fixed computational budget.
Insights into attention complexity across layers.
Abstract
Linear attention has attracted interest as a computationally efficient approximation to softmax attention, especially for long sequences. Recent studies have explored distilling softmax attention in pre-trained Transformers into linear attention. However, a critical challenge remains: how to choose the feature dimension that governs the approximation quality. Existing methods fix this dimension uniformly across all attention layers, overlooking the diverse roles and complexities of them. In this paper, we propose a principled method to automatically determine the feature dimension in linear attention using the concept of statistical degrees of freedom, which represent the effective dimensionality of the inputs. We provide a theoretical bound on the approximation error and show that the dimension chosen by our method achieves smaller error under a fixed computational budget. Furthermore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
