TL;DR
This paper identifies imbalanced training as a cause of catastrophic forgetting in neural machine translation, even in static training, and proposes a method called COKD to mitigate this issue, leading to improved translation performance.
Contribution
It introduces the concept of imbalanced training as a cause of catastrophic forgetting in static neural network training and proposes COKD, a novel knowledge distillation approach to address it.
Findings
COKD effectively alleviates imbalanced training.
Experimental results show substantial improvements in translation quality.
The method outperforms strong baseline systems across multiple tasks.
Abstract
Neural networks tend to gradually forget the previously learned knowledge when learning multiple tasks sequentially from dynamic data distributions. This problem is called \textit{catastrophic forgetting}, which is a fundamental challenge in the continual learning of neural networks. In this work, we observe that catastrophic forgetting not only occurs in continual learning but also affects the traditional static training. Neural networks, especially neural machine translation models, suffer from catastrophic forgetting even if they learn from a static training set. To be specific, the final model pays imbalanced attention to training samples, where recently exposed samples attract more attention than earlier samples. The underlying cause is that training samples do not get balanced training in each model update, so we name this problem \textit{imbalanced training}. To alleviate this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsKnowledge Distillation
