Modalities, a PyTorch-native Framework For Large-scale LLM Training and Research
Max L\"ubbering, Timm Ruland, Richard Rutmann, Felix Stollenwerk, David Fitzek, Michael Fromm, Alexander Weber, Rafet Sifa, Nicolas Flores-Herr, Joachim K\"ohler, Mehdi Ali

TL;DR
Modalities is a comprehensive PyTorch framework designed to facilitate large-scale LLM training and research, enabling efficient ablations and reproducible experiments at trillion-token and billion-parameter scales.
Contribution
It introduces a modular, declarative framework that integrates advanced parallelization strategies for large-scale LLM training and systematic ablation studies.
Findings
Supports efficient pretraining at trillion-token scale
Enables systematic ablation studies with modular design
Improves reproducibility and extensibility of LLM research
Abstract
Today's LLM (pre-) training and research workflows typically allocate a significant amount of compute to large-scale ablation studies. Despite the substantial compute costs of these ablations, existing open-source frameworks provide limited tooling for these experiments, often forcing researchers to write their own wrappers and scripts. We propose Modalities, an end-to-end PyTorch-native framework that integrates data-driven LLM research with large-scale model training from two angles. Firstly, by integrating state-of-the-art parallelization strategies, it enables both efficient pretraining and systematic ablations at trillion-token and billion-parameter scale. Secondly, Modalities adopts modular design with declarative, self-contained configuration, enabling reproducibility and extensibility levels that are difficult to achieve out-of-the-box with existing LLM training frameworks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Natural Language Processing Techniques · Topic Modeling
