Continual Contrastive Spoken Language Understanding

Umberto Cappellazzo; Enrico Fini; Muqiao Yang; Daniele Falavigna,; Alessio Brutti; Bhiksha Raj

arXiv:2310.02699·eess.AS·June 5, 2024

Continual Contrastive Spoken Language Understanding

Umberto Cappellazzo, Enrico Fini, Muqiao Yang, Daniele Falavigna,, Alessio Brutti, Bhiksha Raj

PDF

Open Access 3 Reviews

TL;DR

This paper introduces COCONUT, a continual learning method for spoken language understanding that combines experience replay and contrastive learning to effectively retain knowledge and improve discrimination in sequence-to-sequence models.

Contribution

The paper proposes COCONUT, a novel continual learning approach that integrates contrastive learning with experience replay for improved spoken language understanding.

Findings

01

COCONUT outperforms baseline methods on SLU datasets.

02

Contrastive loss enhances representation discrimination.

03

Combining COCONUT with decoder-side methods yields further improvements.

Abstract

Recently, neural networks have shown impressive progress across diverse fields, with speech processing being no exception. However, recent breakthroughs in this area require extensive offline training using large datasets and tremendous computing resources. Unfortunately, these models struggle to retain their previously acquired knowledge when learning new tasks continually, and retraining from scratch is almost always impractical. In this paper, we investigate the problem of learning sequence-to-sequence models for spoken language understanding in a class-incremental learning (CIL) setting and we propose COCONUT, a CIL method that relies on the combination of experience replay and contrastive learning. Through a modified version of the standard supervised contrastive loss applied only to the rehearsal samples, COCONUT preserves the learned representations by pulling closer samples from…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

Clear presentation of motivations behind the combination of losses, the decisions behind whether to use student vs teacher examples, etc. Nice figures and appropriate complexity to educate without losing the goal of the paper in the weeds.

Weaknesses

For readers that may not be as familiar with results of other SLU work (both E2E and non-E2E), inclusion of results from other work could be useful. Or if such comparisons are not fair, perhaps a note in the table to that effect. The text mentions the other work which describes those rows (like S-KD), but it could be nice to see numbers from other work itself as well (?) for clearer context as well as results of conventional SLU approaches that are not E2E.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

-a new approach for SLU in a CL setting that is better than a strong experience replay (ER) benchmark -experiments on 2 popular SLU benchmarks that demonstrate the effectiveness of the proposed appproach

Weaknesses

-more details on continual learning setting used would have been welcome (ref to (Capellazzo & al 2023) is not very self-explanatory) -experience replay (ER) baseline with buffer capacity of 2% is still better than COCONUT and it is unclear how using twice memory (2% instead of 1%) is a real bottleneck in real applications (authors could have commented this more)

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. The modification of loss function is interesting to mitigate catastrophic forgetting for seq2seq SLU models. 2. Experiments on two benchmarks and the ablation studies verify the effectiveness of proposed method over the previous baselines, as well as the two proposed losses.

Weaknesses

The main weakness of this paper is the unclearness in text and the insufficient in experiments. Please see the Questions part for details.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Multimodal Machine Learning Applications · Music and Audio Processing

MethodsExperience Replay · Supervised Contrastive Loss