TrAct: Making First-layer Pre-Activations Trainable
Felix Petersen, Christian Borgelt, Stefano Ermon

TL;DR
This paper introduces TrAct, a method that makes first-layer pre-activations trainable by performing gradient descent on activations, leading to faster training of vision models with minimal overhead.
Contribution
The paper proposes a novel approach to optimize first-layer activations directly, providing a closed-form solution and demonstrating significant training speedups across various vision models.
Findings
Training speed increased by 1.25x to 4x
Applicable to convolutional and transformer models
Minimal computational overhead
Abstract
We consider the training of the first layer of vision models and notice the clear relationship between pixel values and gradient update magnitudes: the gradients arriving at the weights of a first layer are by definition directly proportional to (normalized) input pixel values. Thus, an image with low contrast has a smaller impact on learning than an image with higher contrast, and a very bright or very dark image has a stronger impact on the weights than an image with moderate brightness. In this work, we propose performing gradient descent on the embeddings produced by the first layer of the model. However, switching to discrete inputs with an embedding layer is not a reasonable option for vision models. Thus, we propose the conceptual procedure of (i) a gradient descent step on first layer activations to construct an activation proposal, and (ii) finding the optimal weights of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsScientific Computing and Data Management
