No Need to Talk: Asynchronous Mixture of Language Models
Anastasiia Filippova, Angelos Katharopoulos, David Grangier, Ronan, Collobert

TL;DR
SMALLTALK LM presents an asynchronous training approach for mixture of language models, enabling efficient specialization and routing without high communication overhead, achieving lower perplexity and strong downstream performance.
Contribution
Introduces SMALLTALK LM, a novel asynchronous training method for language model mixtures with a lightweight routing scheme that improves efficiency and performance.
Findings
Achieves lower perplexity than dense models at similar FLOPs
Uses significantly less parameters during inference
Outperforms dense baselines on 75% of downstream tasks
Abstract
We introduce SMALLTALK LM, an innovative method for training a mixture of language models in an almost asynchronous manner. Each model of the mixture specializes in distinct parts of the data distribution, without the need for high-bandwidth communication between the nodes training each model. At inference, a lightweight router directs a given sequence to a single expert, according to a short prefix. This inference scheme naturally uses a fraction of the parameters from the overall mixture model. Unlike prior works on asynchronous LLM training, our routing method does not rely on full corpus clustering or access to metadata, making it more suitable for real-world applications. Our experiments on language modeling demonstrate that SMALLTALK LM achieves significantly lower perplexity than dense model baselines for the same total training FLOPs and an almost identical inference cost.…
Peer Reviews
Decision·ICLR 2025 Spotlight
The method is simple (to its credit), intuitive, and effective, and it attacks an important problem. While information about the largest and most significant language models is, of course, sparse, they do not appear to be growing at the same rate as they have in previous years, and significant attention is now being devoted to the problem of improving the performance of comparatively small models (e.g. 2B models like Gemma). Good algorithms for asynchronous training could potentially unlock prev
I'm not in love with the choice to benchmark the asynchronous MoE against a dense model with the same number of parameters as each expert but trained for as many tokens as all experts combined (see the para. starting "Comparison to the Dense Model..."). Depending on how the number of training tokens was chosen, it might unfairly bias the results towards the expert setup---training one really saturated model up to 32x longer than each expert might just be a waste of compute and not a realistic co
- The authors show that the same perplexity can be achieved with three times less compute cost compared to the dense baseline - The final performance (test PPL) is not sensitive to the router size hence, making the approach succeed with even tiny routers (4M parameters) - The authors have conducted experiments that are highly expensive in nature. The community will benefit from such a study. Especially, the test PPL benefits (line 349) with three times less compute is a remarkable result, it c
Performance (test PPL) is sensitive to the prefix length (256 tokens used to train routers) and the sensitivity increases with the number of experts.
- Unlike approaches like MoE, the experts here are independent, and therefore, the routing can be done before each training epoch. Since each system can be trained independently, this reduces the simultaneous RAM requirements, which, unlike MoE approaches, enables SmalltalkLM systems to be trained using the same computational hardware. Inference also has lower RAM requirements. - They demonstrate that this approach results in better performance across different model sizes and can yield improv
- As far as I understand, this work takes ideas of existing deep learning MoE but simplifies the process, and instead of using the top-k experts, operating at the token-level, or having experts per layer, have a single router to a single expert, which may be of limited novelty. - The savings in RAM might not be too substantial, as when deployed in live systems, you’ll need to load up all experts simultaneously anyway (to avoid the massive latency of continually reloading the next expert). Since
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
