SSHR: Leveraging Self-supervised Hierarchical Representations for   Multilingual Automatic Speech Recognition

Hongfei Xue; Qijie Shao; Kaixun Huang; Peikun Chen; Jie Liu; Lei Xie

arXiv:2309.16937·cs.CL·April 30, 2024

SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition

Hongfei Xue, Qijie Shao, Kaixun Huang, Peikun Chen, Jie Liu, Lei Xie

PDF

Open Access

TL;DR

This paper introduces SSHR, a novel approach that leverages hierarchical representations in self-supervised models to improve multilingual automatic speech recognition, achieving state-of-the-art results.

Contribution

The study proposes a method to exploit different layer representations in SSL models for better multilingual ASR, including a novel hierarchical extraction and guidance mechanism.

Findings

01

Middle layers contain language-related info

02

High layers encode content-related info

03

Method achieves state-of-the-art performance on benchmarks

Abstract

Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) models, like MMS, have demonstrated their effectiveness in multilingual ASR, it is worth noting that various layers' representations potentially contain distinct information that has not been fully leveraged. In this study, we propose a novel method that leverages self-supervised hierarchical representations (SSHR) to fine-tune the MMS model. We first analyze the different layers of MMS and show that the middle layers capture language-related information, and the high layers encode content-related information, which gradually decreases in the final layers. Then, we extract a language-related frame from correlated middle layers and guide specific language extraction through self-attention mechanisms. Additionally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing