Speech representation learning: Learning bidirectional encoders with single-view, multi-view, and multi-task methods
Qingming Tang

TL;DR
This thesis explores various methods for speech representation learning across multiple settings, emphasizing the use of unlabeled and weakly labeled data to improve sequence prediction tasks before the Transformer era.
Contribution
It provides a comprehensive study of speech representation learning methods across supervised, unsupervised, semi-supervised, and multi-view settings, highlighting approaches beyond the Transformer-based models.
Findings
Effective multi-view learning strategies for speech data
Unsupervised and semi-supervised methods improve downstream tasks
Insights into speech representation learning before large-scale pre-training
Abstract
This thesis focuses on representation learning for sequence data over time or space, aiming to improve downstream sequence prediction tasks by using the learned representations. Supervised learning has been the most dominant approach for training deep neural networks for learning good sequential representations. However, one limiting factor to scale supervised learning is the lack of enough annotated data. Motivated by this challenge, it is natural to explore representation learning methods that can utilize large amounts of unlabeled and weakly labeled data, as well as an additional data modality. I describe my broad study of representation learning for speech data. Unlike most other works that focus on a single learning setting, this thesis studies multiple settings: supervised learning with auxiliary losses, unsupervised learning, semi-supervised learning, and multi-view learning.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Layer Normalization · Softmax · Linear Layer · Adam · Dense Connections · Label Smoothing · Dropout
