Speech representation learning: Learning bidirectional encoders with   single-view, multi-view, and multi-task methods

Qingming Tang

arXiv:2308.00129·eess.AS·August 2, 2023

Speech representation learning: Learning bidirectional encoders with single-view, multi-view, and multi-task methods

Qingming Tang

PDF

Open Access

TL;DR

This thesis explores various methods for speech representation learning across multiple settings, emphasizing the use of unlabeled and weakly labeled data to improve sequence prediction tasks before the Transformer era.

Contribution

It provides a comprehensive study of speech representation learning methods across supervised, unsupervised, semi-supervised, and multi-view settings, highlighting approaches beyond the Transformer-based models.

Findings

01

Effective multi-view learning strategies for speech data

02

Unsupervised and semi-supervised methods improve downstream tasks

03

Insights into speech representation learning before large-scale pre-training

Abstract

This thesis focuses on representation learning for sequence data over time or space, aiming to improve downstream sequence prediction tasks by using the learned representations. Supervised learning has been the most dominant approach for training deep neural networks for learning good sequential representations. However, one limiting factor to scale supervised learning is the lack of enough annotated data. Motivated by this challenge, it is natural to explore representation learning methods that can utilize large amounts of unlabeled and weakly labeled data, as well as an additional data modality. I describe my broad study of representation learning for speech data. Unlike most other works that focus on a single learning setting, this thesis studies multiple settings: supervised learning with auxiliary losses, unsupervised learning, semi-supervised learning, and multi-view learning.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Layer Normalization · Softmax · Linear Layer · Adam · Dense Connections · Label Smoothing · Dropout