Mixtera: A Data Plane for Foundation Model Training
Maximilian B\"other, Xiaozhe Yao, Tolga Kerimoglu, Dan Graur, Viktor Gsteiger, Ana Klimovic

TL;DR
Mixtera is a flexible data management system for foundation model training that allows declarative specification and dynamic adjustment of data mixtures, improving training efficiency and effectiveness.
Contribution
We introduce Mixtera, a novel data plane that enables declarative, scalable, and dynamic data mixture management for large-scale foundation model training.
Findings
Mixtera scales to 256 GH200 superchips without bottlenecks.
Implementing ADO improves data mixture effectiveness.
Supports mixing strategies for vision-language models.
Abstract
State-of-the-art large language and vision models are trained over trillions of tokens that are aggregated from a large variety of sources. As training data collections grow, manually managing the samples becomes time-consuming, tedious, and prone to errors. Yet recent research shows that the data mixture and the order in which samples are visited during training can significantly influence model accuracy. We build and present Mixtera, a data plane for foundation model training that enables users to declaratively express which data samples should be used in which proportion and in which order during training. Mixtera is a centralized, read-only layer that is deployed on top of existing training data collections and can be declaratively queried. It operates independently of the filesystem structure and supports mixtures across arbitrary properties (e.g., language, source dataset) as well…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
