Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights
Yulu Gan, Phillip Isola

TL;DR
This paper proposes viewing pretrained models as distributions over parameters containing many task-specific experts, and demonstrates that large models have dense neighborhoods of such experts, enabling simple sampling-based methods to improve performance.
Contribution
It introduces a new perspective on pretraining as a distribution over parameters with dense task experts, and shows simple sampling methods can effectively find task-specific solutions in large models.
Findings
Large models have a high density of task experts around pretrained weights.
Sampling and ensembling parameter perturbations can match standard post-training methods.
Simple parallel sampling approaches are competitive with complex optimization techniques.
Abstract
Pretraining produces a learned parameter vector that is typically treated as a starting point for further iterative adaptation. In this work, we instead view the outcome of pretraining as a distribution over parameter vectors, whose support already contains task-specific experts. We show that in small models such expert solutions occupy a negligible fraction of the volume of this distribution, making their discovery reliant on structured optimization methods such as gradient descent. In contrast, in large, well-pretrained models the density of task-experts increases dramatically, so that diverse, task-improving specialists populate a substantial fraction of the neighborhood around the pretrained weights. Motivated by this perspective, we explore a simple, fully parallel post-training method that samples parameter perturbations at random, selects the top , and ensembles…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
