mHuBERT-147: A Compact Multilingual HuBERT Model

Marcely Zanon Boito; Vivek Iyer; Nikolaos Lagos; Laurent Besacier,; Ioan Calapodescu

arXiv:2406.06371·cs.CL·November 22, 2024·2 cites

mHuBERT-147: A Compact Multilingual HuBERT Model

Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier,, Ioan Calapodescu

PDF

Open Access 1 Repo 3 Models

TL;DR

mHuBERT-147 is a compact, multilingual speech model trained on 90K hours that achieves state-of-the-art results with significantly fewer parameters through innovative training strategies.

Contribution

The paper introduces mHuBERT-147, a novel multilingual HuBERT model that is faster to train and more parameter-efficient while outperforming larger models on speech tasks.

Findings

01

Outperforms larger models on ML-SUPERB benchmarks

02

Achieves state-of-the-art scores for 3 speech tasks

03

Demonstrates strong competitiveness with models trained on more data

Abstract

We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations, our compact 95M parameter mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min and 1h leaderboards, with SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings indicate that mHuBERT-147 is a promising model for multilingual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/utter-project/mHuBERT-147
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques