Context-aware Fine-tuning of Self-supervised Speech Models

Suwon Shon; Felix Wu; Kwangyoun Kim; Prashant Sridhar; Karen Livescu,; Shinji Watanabe

arXiv:2212.08542·eess.AS·March 30, 2023

Context-aware Fine-tuning of Self-supervised Speech Models

Suwon Shon, Felix Wu, Kwangyoun Kim, Prashant Sridhar, Karen Livescu,, Shinji Watanabe

PDF

Open Access

TL;DR

This paper introduces a context-aware fine-tuning method for self-supervised speech models that improves performance on various speech tasks by encoding surrounding segments during training without increasing inference complexity.

Contribution

The paper proposes a novel context-aware fine-tuning approach that encodes surrounding speech segments into a context vector, enhancing downstream task performance without added inference overhead.

Findings

01

Outperforms standard fine-tuning on SLUE and Libri-light benchmarks.

02

Rivals strong context injection baselines during inference.

03

Improves performance on ASR, NER, and sentiment analysis tasks.

Abstract

Self-supervised pre-trained transformers have improved the state of the art on a variety of speech tasks. Due to the quadratic time and space complexity of self-attention, they usually operate at the level of relatively short (e.g., utterance) segments. In this paper, we study the use of context, i.e., surrounding segments, during fine-tuning and propose a new approach called context-aware fine-tuning. We attach a context module on top of the last layer of a pre-trained model to encode the whole segment into a context embedding vector which is then used as an additional feature for the final prediction. During the fine-tuning stage, we introduce an auxiliary loss that encourages this context embedding vector to be similar to context vectors of surrounding segments. This allows the model to make predictions without access to these surrounding segments at inference time and requires only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques