PRISM: Diversifying Dataset Distillation by Decoupling Architectural Priors
Brian B. Moser, Shalini Sarode, Federico Raue, Stanislav Frolov, Krzysztof Adamkiewicz, Arundhati Shanbhag, Joachim Folz, Tobias C. Nauen, Andreas Dengel

TL;DR
PRISM introduces a novel dataset distillation framework that leverages diverse teacher models to enhance intra-class diversity and generalization in synthetic data, outperforming existing methods on ImageNet-1K.
Contribution
PRISM decouples architectural priors during dataset synthesis by supervising logits and BN alignment with different teacher models, improving diversity and performance.
Findings
Outperforms single-teacher and multi-teacher methods on ImageNet-1K.
Generated data exhibits significantly richer intra-class diversity.
Scalable cross-class batch formation enables fast parallel synthesis.
Abstract
Dataset distillation (DD) promises compact yet faithful synthetic data, but existing approaches often inherit the inductive bias of a single teacher model. As dataset size increases, this bias drives generation toward overly smooth, homogeneous samples, reducing intra-class diversity and limiting generalization. We present PRISM (PRIors from diverse Source Models), a framework that disentangles architectural priors during synthesis. PRISM decouples the logit-matching and regularization objectives, supervising them with different teacher architectures: a primary model for logits and a stochastic subset for batch-normalization (BN) alignment. On ImageNet-1K, PRISM consistently and reproducibly outperforms single-teacher methods (e.g., SRe2L) and recent multi-teacher variants (e.g., G-VBSM) at low- and mid-IPC regimes. The generated data also show significantly richer intra-class…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Pretty intuitive motivation, and presented clearly - Strong empirical results - Thorough ablation studies
- Relatively incremental contribution. The use of multiple architecture backbones has been studied in much prior work [1, 2], along with the choice of using teacher emsembles for prediction generation. In particular [2], uses a very similar idea of rotating teacher models during optimization, albeit using an ensemble of the same resnet-18 model retrained multiple times (rather than different backbones) in this paper - The marginal contribution of this work over previous work seems minor. In par
- **Clear motivation**: The paper identifies an issue in prior DD works and presents a simple idea (multi-teacher decoupling) to mitigate it. - **Technical soundness and orthogonality**: The decoupling formulation is well-grounded and orthogonal to existing improvements (e.g., DELT). The derivation is clear. - **Clarity and completeness**: The manuscript is well-structured, and the supplementary material provides additional training configurations, privacy discussions, and ablation studies.
- **Lack of evidence for diversity claim**: The authors claim that PRISM produces distilled data with higher intra-class diversity and lower cosine similarity, but the evidence is limited to one similarity curve (Fig. 4) and qualitative samples. Besides, there is no comparison to the real dataset’s diversity level, leaving it unclear whether PRISM truly approximates or exceeds the diversity of raw data. The claim of “diverse distilled samples benefit to DD” remains unsubstantiated. - **Marginal
1. The method and experiments are presented in a structured and easy-to-follow manner. 2. From the experimental results, the proposed method is very effective.
1. The proposed idea is **highly similar to the CV-DD [1] paper** released on arXiv in early 2025, which also employs an ensemble-style multi-model framework combining multiple model predictions and BN distribution alignments. PRISM mainly removes the multi-prediction matching component from CV-DD [1] and retains only a single prediction alignment, making the modification relatively **minor and incremental**. Furthermore, the paper does not cite CV-DD [1], raising concerns about the authors’ awa
The presentation is clear and easy to follow.
The overall writing and experimental presentation are quite rough, and the manuscript is not ready for publication in its current form. The proposed method shows very limited novelty, appearing to be only a marginal modification of G-VBSM without substantial conceptual advancement. Several important results are missing from the main comparison tables. Even if the original paper did not report these results, the referenced baseline methods are open-sourced, and the authors are expected to run t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification
