Multimodal Dataset Distillation via Phased Teacher Models

Shengbin Guo; Hang Zhao; Senqiao Yang; Chenyang Jiang; Yuhang Cheng; Xiangru Peng; Rui Shao; Zhuotao Tian

arXiv:2603.25388·cs.CV·March 27, 2026

Multimodal Dataset Distillation via Phased Teacher Models

Shengbin Guo, Hang Zhao, Senqiao Yang, Chenyang Jiang, Yuhang Cheng, Xiangru Peng, Rui Shao, Zhuotao Tian

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces PTM-ST, a phased distillation framework that improves multimodal dataset distillation by modeling teacher learning dynamics across training phases, leading to better performance and stability.

Contribution

The paper proposes a novel phased teacher model with shortcut trajectory to better capture teacher dynamics and enhance dataset distillation quality.

Findings

01

Significantly reduces optimization oscillations.

02

Mitigates inter-phase knowledge gaps.

03

Achieves up to 13.5% improvement on Flickr30k.

Abstract

Multimodal dataset distillation aims to construct compact synthetic datasets that enable efficient compression and knowledge transfer from large-scale image-text data. However, existing approaches often fail to capture the complex, dynamically evolving knowledge embedded in the later training stages of teacher models. This limitation leads to degraded student performance and compromises the quality of the distilled data. To address critical challenges such as pronounced cross-stage performance gaps and unstable teacher trajectories, we propose Phased Teacher Model with Shortcut Trajectory (PTM-ST) -- a novel phased distillation framework. PTM-ST leverages stage-aware teacher modeling and a shortcut-based trajectory construction strategy to accurately fit the teacher's learning dynamics across distinct training phases. This enhances both the stability and expressiveness of the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

- Novelty. The paper identifies the "phased knowledge gap" as a critical issue in multimodal dataset distillation, where student performance degrades when using teacher models from later training stages. The proposed PTM-ST framework is a novel and effective solution that addresses this problem through phased learning and trajectory stabilization. - Strong empirical performance. The experimental results on Flickr30K and COCO datasets show that PTM-ST consistently and significantly outperforms e

Weaknesses

- The PTM-ST framework introduces additional complexity, requiring manual specification of interpolation endpoints and matching ranges for each distillation stage. This may make the method difficult to apply to new datasets or tasks and raises concerns about hyperparameter sensitivity. - Limited Scope of Evaluation: The experiments are limited to image-text retrieval. It would be beneficial to evaluate the generalizability of PTM-ST on other multimodal tasks, such as Visual Question Answering (

Reviewer 02Rating 6Confidence 3

Strengths

- The method demonstrates promising empirical results, achieving strong and consistent improvements over the compared baselines. - The approach of using sets of teacher parameters (checkpoints) effectively provides the student model with a global perspective on the teacher's training dynamics.

Weaknesses

- The method requires access to intermediate teacher checkpoints. For very large models (the bigger CLIP variants), these are often unavailable, and replicating the teacher training to generate them introduces significant computational overhead. - The student model's architecture and specific training details are missing.

Reviewer 03Rating 2Confidence 4

Strengths

* Strong motivation. The authors explore phased knowledge gaps in multimodal dataset distillation by investigating the insight differences in distillation dynamics between multimodal and unimodal data distillation techniques, offering valuable finding into modality-specific learning behavior. * Proposes a simple yet effective phased distillation strategy (PTM-ST), which divides the learning process into semantic phases and aligns them with the teacher’s learning dynamics for trajectory endpoint

Weaknesses

- **Limited data scale and generalizability evidendence:** The experiments are conducted on small datasets (e.g., Flickr30K, COCO) with limited computational resources, which may not capture the knowledge-gap behavior at large scale. As a result, the observed early-phase effectiveness and multi-stage training strategy may not generalize to full-scale CLIP-style pretraining. - **Additional training cost and model complexity:** Requires more training and model complexity. Although the method aim

Code & Models

Datasets

Previsior22/PTM-ST
dataset· 110 dl
110 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis