TL;DR
This paper introduces Continual Distillation, a method for sequentially learning from multiple teacher models across different domains without access to previous teachers, addressing challenges of knowledge transfer and forgetting.
Contribution
It proposes SE2D, a novel approach that stabilizes learning from heterogeneous teachers using external data, improving cross-domain generalization and reducing knowledge forgetting.
Findings
SE2D effectively reduces Unseen Knowledge Forgetting.
External unlabeled data enables transfer from unseen domains.
SE2D improves performance across multiple benchmarks.
Abstract
Deep learning models continue to scale, with some requiring more storage than many large-scale datasets. Thus, we introduce a new paradigm: Continual Distillation (CD), where a student learns sequentially from a stream of teacher models without retaining access to earlier teachers. CD faces two challenges: teacher training data is unavailable, and teachers have varying expertise. We show that external unlabeled data enables Unseen Knowledge Transfer (UKT), allowing the student to acquire information from domains not present in the training data, while known to the teacher. We also show that sequential distillation causes Unseen Knowledge Forgetting (UKF) when transferred knowledge is lost after training on later teachers. To better trade off between UKT and UKF, we propose Self External Data Distillation (SE2D), a method that preserves logits on external data to stabilize learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
