Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised   Learning with Masked Unit Prediction

Jiatong Shi; Hirofumi Inaguma; Xutai Ma; Ilia Kulikov; Anna Sun

arXiv:2310.02720·cs.SD·January 31, 2024·1 cites

Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction

Jiatong Shi, Hirofumi Inaguma, Xutai Ma, Ilia Kulikov, Anna Sun

PDF

Open Access 1 Video

TL;DR

This paper introduces a multi-resolution speech SSL model using a hierarchical Transformer and masked unit prediction, improving efficiency and performance across speech recognition benchmarks.

Contribution

It presents a novel multi-resolution hierarchical Transformer architecture for speech SSL, enhancing performance and efficiency over existing fixed-resolution models.

Findings

01

Outperforms HuBERT on LibriSpeech speech recognition tasks.

02

Achieves superior results on SUPERB and ML-SUPERB benchmarks.

03

Offers more efficient inference compared to prior models.

Abstract

Existing Self-Supervised Learning (SSL) models for speech typically process speech signals at a fixed resolution of 20 milliseconds. This approach overlooks the varying informational content present at different resolutions in speech signals. In contrast, this paper aims to incorporate multi-resolution information into speech self-supervised representation learning. We introduce a SSL model that leverages a hierarchical Transformer architecture, complemented by HuBERT-style masked prediction objectives, to process speech at multiple resolutions. Experimental results indicate that the proposed model not only achieves more efficient inference but also exhibits superior or comparable performance to the original HuBERT model over various tasks. Specifically, significant performance improvements over the original HuBERT have been observed in fine-tuning experiments on the LibriSpeech speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing