M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses

Yufeng Yang; Desh Raj; Ju Lin; Niko Moritz; Junteng Jia; Gil Keren,; Egor Lakomkin; Yiteng Huang; Jacob Donley; Jay Mahadeokar; Ozlem Kalinli

arXiv:2409.11494·eess.AS·September 19, 2024

M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses

Yufeng Yang, Desh Raj, Ju Lin, Niko Moritz, Junteng Jia, Gil Keren,, Egor Lakomkin, Yiteng Huang, Jacob Donley, Jay Mahadeokar, Ozlem Kalinli

PDF

Open Access

TL;DR

This paper introduces M-BEST-RQ, a multi-channel speech foundation model for smart glasses that leverages large-scale self-supervised learning to outperform supervised models on various real-world tasks with minimal labeled data.

Contribution

It presents the first multi-channel speech foundation model for smart glasses using SSL, evaluated on real downstream tasks, outperforming supervised models with less labeled data.

Findings

01

M-BEST-RQ matches or surpasses supervised models across tasks.

02

The model outperforms a supervised ASR baseline with only 8 hours of labeled data.

03

Evaluation on real datasets demonstrates practical effectiveness.

Abstract

The growing popularity of multi-channel wearable devices, such as smart glasses, has led to a surge of applications such as targeted speech recognition and enhanced hearing. However, current approaches to solve these tasks use independently trained models, which may not benefit from large amounts of unlabeled data. In this paper, we propose M-BEST-RQ, the first multi-channel speech foundation model for smart glasses, which is designed to leverage large-scale self-supervised learning (SSL) in an array-geometry agnostic approach. While prior work on multi-channel speech SSL only evaluated on simulated settings, we curate a suite of real downstream tasks to evaluate our model, namely (i) conversational automatic speech recognition (ASR), (ii) spherical active source localization, and (iii) glasses wearer voice activity detection, which are sourced from the MMCSG and EasyCom datasets. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInteractive and Immersive Displays · Tactile and Sensory Interactions