Self-Supervised Acoustic Word Embedding Learning via Correspondence   Transformer Encoder

Jingru Lin; Xianghu Yue; Junyi Ao; Haizhou Li

arXiv:2307.09871·eess.AS·July 20, 2023·Interspeech

Self-Supervised Acoustic Word Embedding Learning via Correspondence Transformer Encoder

Jingru Lin, Xianghu Yue, Junyi Ao, Haizhou Li

PDF

Open Access

TL;DR

This paper introduces a self-supervised learning method using a transformer encoder to produce robust acoustic word embeddings that are invariant to speech variations, achieving state-of-the-art results in low-resource cross-lingual settings.

Contribution

It proposes the Correspondence Transformer Encoder, a novel self-supervised framework that learns invariant AWEs from unlabelled speech using a teacher-student approach.

Findings

01

Embeddings are robust to speaker and domain variations.

02

Achieves state-of-the-art performance on low-resource cross-lingual tasks.

03

Effective in learning from unlabelled speech data.

Abstract

Acoustic word embeddings (AWEs) aims to map a variable-length speech segment into a fixed-dimensional representation. High-quality AWEs should be invariant to variations, such as duration, pitch and speaker. In this paper, we introduce a novel self-supervised method to learn robust AWEs from a large-scale unlabelled speech corpus. Our model, named Correspondence Transformer Encoder (CTE), employs a teacher-student learning framework. We train the model based on the idea that different realisations of the same word should be close in the underlying embedding space. Specifically, we feed the teacher and student encoder with different acoustic instances of the same word and pre-train the model with a word-level loss. Our experiments show that the embeddings extracted from the proposed CTE model are robust to speech variations, e.g. speakers and domains. Additionally, when evaluated on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Softmax · Dense Connections · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection