Multimodal Dataset Distillation via Phased Teacher Models
Shengbin Guo, Hang Zhao, Senqiao Yang, Chenyang Jiang, Yuhang Cheng, Xiangru Peng, Rui Shao, Zhuotao Tian

TL;DR
This paper introduces PTM-ST, a phased distillation framework that improves multimodal dataset distillation by modeling teacher learning dynamics across training phases, leading to better performance and stability.
Contribution
The paper proposes a novel phased teacher model with shortcut trajectory to better capture teacher dynamics and enhance dataset distillation quality.
Findings
Significantly reduces optimization oscillations.
Mitigates inter-phase knowledge gaps.
Achieves up to 13.5% improvement on Flickr30k.
Abstract
Multimodal dataset distillation aims to construct compact synthetic datasets that enable efficient compression and knowledge transfer from large-scale image-text data. However, existing approaches often fail to capture the complex, dynamically evolving knowledge embedded in the later training stages of teacher models. This limitation leads to degraded student performance and compromises the quality of the distilled data. To address critical challenges such as pronounced cross-stage performance gaps and unstable teacher trajectories, we propose Phased Teacher Model with Shortcut Trajectory (PTM-ST) -- a novel phased distillation framework. PTM-ST leverages stage-aware teacher modeling and a shortcut-based trajectory construction strategy to accurately fit the teacher's learning dynamics across distinct training phases. This enhances both the stability and expressiveness of the…
Peer Reviews
Decision·ICLR 2026 Poster
- Novelty. The paper identifies the "phased knowledge gap" as a critical issue in multimodal dataset distillation, where student performance degrades when using teacher models from later training stages. The proposed PTM-ST framework is a novel and effective solution that addresses this problem through phased learning and trajectory stabilization. - Strong empirical performance. The experimental results on Flickr30K and COCO datasets show that PTM-ST consistently and significantly outperforms e
- The PTM-ST framework introduces additional complexity, requiring manual specification of interpolation endpoints and matching ranges for each distillation stage. This may make the method difficult to apply to new datasets or tasks and raises concerns about hyperparameter sensitivity. - Limited Scope of Evaluation: The experiments are limited to image-text retrieval. It would be beneficial to evaluate the generalizability of PTM-ST on other multimodal tasks, such as Visual Question Answering (
- The method demonstrates promising empirical results, achieving strong and consistent improvements over the compared baselines. - The approach of using sets of teacher parameters (checkpoints) effectively provides the student model with a global perspective on the teacher's training dynamics.
- The method requires access to intermediate teacher checkpoints. For very large models (the bigger CLIP variants), these are often unavailable, and replicating the teacher training to generate them introduces significant computational overhead. - The student model's architecture and specific training details are missing.
* Strong motivation. The authors explore phased knowledge gaps in multimodal dataset distillation by investigating the insight differences in distillation dynamics between multimodal and unimodal data distillation techniques, offering valuable finding into modality-specific learning behavior. * Proposes a simple yet effective phased distillation strategy (PTM-ST), which divides the learning process into semantic phases and aligns them with the teacher’s learning dynamics for trajectory endpoint
- **Limited data scale and generalizability evidendence:** The experiments are conducted on small datasets (e.g., Flickr30K, COCO) with limited computational resources, which may not capture the knowledge-gap behavior at large scale. As a result, the observed early-phase effectiveness and multi-stage training strategy may not generalize to full-scale CLIP-style pretraining. - **Additional training cost and model complexity:** Requires more training and model complexity. Although the method aim
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
