Continual Learning with Transformers for Image Classification
Beyza Ermis, Giovanni Zappella, Martin Wistuba, Aditya Rawal, Cedric, Archambeau

TL;DR
This paper demonstrates that Adaptive Distillation of Adapters (ADA) enables continual learning with pre-trained Transformers in image classification, maintaining performance without retraining or increasing parameters, and offering faster inference.
Contribution
The paper validates ADA for continual learning in computer vision, showing it preserves performance efficiently without model retraining or parameter growth.
Findings
ADA maintains high accuracy across tasks.
ADA is faster at inference than state-of-the-art methods.
ADA does not require retraining or parameter expansion.
Abstract
In many real-world scenarios, data to train machine learning models become available over time. However, neural network models struggle to continually learn new concepts without forgetting what has been learnt in the past. This phenomenon is known as catastrophic forgetting and it is often difficult to prevent due to practical constraints, such as the amount of data that can be stored or the limited computation sources that can be used. Moreover, training large neural networks, such as Transformers, from scratch is very costly and requires a vast amount of training data, which might not be available in the application domain of interest. A recent trend indicates that dynamic architectures based on an expansion of the parameters can reduce catastrophic forgetting efficiently in continual learning, but this needs complex tuning to balance the growing number of parameters and barely share…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
