Embedding Recycling for Language Models
Jon Saad-Falcon, Amanpreet Singh, Luca Soldaini, Mike D'Arcy, Arman, Cohan, Doug Downey

TL;DR
Embedding recycling (ER) leverages cached model activations to significantly speed up training and inference across multiple language models and tasks with minimal accuracy loss.
Contribution
This paper provides the first extensive evaluation of ER techniques across diverse models and tasks, demonstrating their practical effectiveness and potential for speed improvements.
Findings
Over 90% training speedup with minimal accuracy impact
Effective ER across models from 17M to 900M parameters
Identifies key areas for future research in ER methods
Abstract
Real-world applications of neural language models often involve running many different models over the same corpus. The high computational cost of these runs has led to interest in techniques that can reuse the contextualized embeddings produced in previous runs to speed training and inference of future ones. We refer to this approach as embedding recycling (ER). While multiple ER techniques have been proposed, their practical effectiveness is still unknown because existing evaluations consider very few models and do not adequately account for overhead costs. We perform an extensive evaluation of ER across eight different models (17 to 900 million parameters) and fourteen tasks in English. We show how a simple ER technique that caches activations from an intermediate layer of a pretrained model, and learns task-specific adapters on the later layers, is broadly effective. For the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
