Slamming: Training a Speech Language Model on One GPU in a Day
Gallil Maimon, Avishai Elmakies, Yossi Adi

TL;DR
This paper presents Slam, a practical recipe for training high-quality Speech Language Models on a single GPU within 24 hours, making SLM research more accessible and scalable.
Contribution
It introduces a comprehensive training recipe for SLMs that achieves competitive performance with minimal compute and time, outperforming existing scaling law predictions.
Findings
SLAM achieves high-quality SLMs on a single GPU in 24 hours.
The recipe scales well with increased compute, matching leading models.
Results surpass predicted compute optimal performance, indicating high feasibility.
Abstract
We introduce Slam, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples at - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
