Matryoshka Model Learning for Improved Elastic Student Models

Chetan Verma; Aditya Srinivas Timmaraju; Cho-Jui Hsieh; Suyash Damle; Ngot Bui; Yang Zhang; Wen Chen; Xin Liu; Prateek Jain; Inderjit S Dhillon

arXiv:2505.23337·cs.LG·December 3, 2025

Matryoshka Model Learning for Improved Elastic Student Models

Chetan Verma, Aditya Srinivas Timmaraju, Cho-Jui Hsieh, Suyash Damle, Ngot Bui, Yang Zhang, Wen Chen, Xin Liu, Prateek Jain, Inderjit S Dhillon

PDF

Open Access

TL;DR

MatTA is a framework that efficiently trains multiple accurate student models from a single teacher-TA-student setup, enabling better trade-offs between accuracy and serving costs in production ML systems.

Contribution

The paper introduces a novel Teacher-TA-Student training recipe that produces multiple high-quality student models from one training run, enhancing model deployment flexibility.

Findings

01

Achieved 20% improvement in a key metric in live A/B tests.

02

Demonstrated over 24% relative improvement on SAT Math with GPT-2 Medium.

03

Showed over 10% improvement on LAMBADA benchmark.

Abstract

Industry-grade ML models are carefully designed to meet rapidly evolving serving constraints, which requires significant resources for model development. In this paper, we propose MatTA, a framework for training multiple accurate Student models using a novel Teacher-TA-Student recipe. TA models are larger versions of the Student models with higher capacity, and thus allow Student models to better relate to the Teacher model and also bring in more domain-specific expertise. Furthermore, multiple accurate Student models can be extracted from the TA model. Therefore, despite only one training run, our methodology provides multiple servable options to trade off accuracy for lower serving cost. We demonstrate the proposed method, MatTA, on proprietary datasets and models. Its practical efficacy is underscored by live A/B tests within a production ML system, demonstrating 20% improvement on a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Online Learning and Analytics · Machine Learning and Data Classification

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Weight Decay · Cosine Annealing · Multi-Head Attention · Byte Pair Encoding · Dropout · Residual Connection · Layer Normalization