CHAPTER: Exploiting Convolutional Neural Network Adapters for   Self-supervised Speech Models

Zih-Ching Chen; Yu-Shun Sung; Hung-yi Lee

arXiv:2212.01282·eess.AS·January 23, 2023

CHAPTER: Exploiting Convolutional Neural Network Adapters for Self-supervised Speech Models

Zih-Ching Chen, Yu-Shun Sung, Hung-yi Lee

PDF

Open Access

TL;DR

This paper introduces CHAPTER, a CNN adapter-based tuning method for self-supervised speech models like HuBERT, enabling efficient adaptation at the feature extractor with fewer parameters and improved performance on downstream tasks.

Contribution

The paper proposes a novel CNN adapter approach for SSL speech models that enables efficient feature extractor adaptation, reducing parameter tuning and enhancing task performance.

Findings

01

Fewer than 5% of parameters need tuning per task.

02

Improved accuracy on speaker identification from 87.71% to 91.56%.

03

Enhanced emotion and speaker task performance.

Abstract

Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data. Transformer based models such as HuBERT, which consist a feature extractor and transformer layers, are leading the field in the speech domain. SSL models are fine-tuned on a wide range of downstream tasks, which involves re-training the majority of the model for each task. Previous studies have introduced applying adapters, which are small lightweight modules commonly used in Natural Language Processing (NLP) to adapt pre-trained models to new tasks. However, such efficient tuning techniques only provide adaptation at the transformer layer, but failed to perform adaptation at the feature extractor. In this paper, we propose CHAPTER, an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor. Using this method, we can only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsAttention Is All You Need · Absolute Position Encodings · Layer Normalization · Softmax · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Linear Layer · Dense Connections · Label Smoothing