Continual Contrastive Spoken Language Understanding
Umberto Cappellazzo, Enrico Fini, Muqiao Yang, Daniele Falavigna,, Alessio Brutti, Bhiksha Raj

TL;DR
This paper introduces COCONUT, a continual learning method for spoken language understanding that combines experience replay and contrastive learning to effectively retain knowledge and improve discrimination in sequence-to-sequence models.
Contribution
The paper proposes COCONUT, a novel continual learning approach that integrates contrastive learning with experience replay for improved spoken language understanding.
Findings
COCONUT outperforms baseline methods on SLU datasets.
Contrastive loss enhances representation discrimination.
Combining COCONUT with decoder-side methods yields further improvements.
Abstract
Recently, neural networks have shown impressive progress across diverse fields, with speech processing being no exception. However, recent breakthroughs in this area require extensive offline training using large datasets and tremendous computing resources. Unfortunately, these models struggle to retain their previously acquired knowledge when learning new tasks continually, and retraining from scratch is almost always impractical. In this paper, we investigate the problem of learning sequence-to-sequence models for spoken language understanding in a class-incremental learning (CIL) setting and we propose COCONUT, a CIL method that relies on the combination of experience replay and contrastive learning. Through a modified version of the standard supervised contrastive loss applied only to the rehearsal samples, COCONUT preserves the learned representations by pulling closer samples from…
Peer Reviews
Decision·Submitted to ICLR 2024
Clear presentation of motivations behind the combination of losses, the decisions behind whether to use student vs teacher examples, etc. Nice figures and appropriate complexity to educate without losing the goal of the paper in the weeds.
For readers that may not be as familiar with results of other SLU work (both E2E and non-E2E), inclusion of results from other work could be useful. Or if such comparisons are not fair, perhaps a note in the table to that effect. The text mentions the other work which describes those rows (like S-KD), but it could be nice to see numbers from other work itself as well (?) for clearer context as well as results of conventional SLU approaches that are not E2E.
-a new approach for SLU in a CL setting that is better than a strong experience replay (ER) benchmark -experiments on 2 popular SLU benchmarks that demonstrate the effectiveness of the proposed appproach
-more details on continual learning setting used would have been welcome (ref to (Capellazzo & al 2023) is not very self-explanatory) -experience replay (ER) baseline with buffer capacity of 2% is still better than COCONUT and it is unclear how using twice memory (2% instead of 1%) is a real bottleneck in real applications (authors could have commented this more)
1. The modification of loss function is interesting to mitigate catastrophic forgetting for seq2seq SLU models. 2. Experiments on two benchmarks and the ablation studies verify the effectiveness of proposed method over the previous baselines, as well as the two proposed losses.
The main weakness of this paper is the unclearness in text and the insufficient in experiments. Please see the Questions part for details.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Multimodal Machine Learning Applications · Music and Audio Processing
MethodsExperience Replay · Supervised Contrastive Loss
