Spatial HuBERT: Self-supervised Spatial Speech Representation Learning   for a Single Talker from Multi-channel Audio

Antoni Dimitriadis; Siqi Pan; Vidhyasaharan Sethu; Beena Ahmed

arXiv:2310.10922·cs.CL·October 18, 2023·1 cites

Spatial HuBERT: Self-supervised Spatial Speech Representation Learning for a Single Talker from Multi-channel Audio

Antoni Dimitriadis, Siqi Pan, Vidhyasaharan Sethu, Beena Ahmed

PDF

Open Access

TL;DR

Spatial HuBERT is a novel self-supervised model that learns both acoustic and spatial features from multi-channel audio, significantly improving spatial speech tasks in noisy environments and aiding in speech localisation.

Contribution

It introduces Spatial HuBERT, the first self-supervised model to learn spatial speech representations from multi-channel audio for a single speaker, surpassing single-channel methods.

Findings

01

Outperforms state-of-the-art single-channel speech representations in spatial tasks

02

Effective in reverberant and noisy environments

03

Demonstrates utility in speech localisation

Abstract

Self-supervised learning has been used to leverage unlabelled data, improving accuracy and generalisation of speech systems through the training of representation models. While many recent works have sought to produce effective representations across a variety of acoustic domains, languages, modalities and even simultaneous speakers, these studies have all been limited to single-channel audio recordings. This paper presents Spatial HuBERT, a self-supervised speech representation model that learns both acoustic and spatial information pertaining to a single speaker in a potentially noisy environment by using multi-channel audio inputs. Spatial HuBERT learns representations that outperform state-of-the-art single-channel speech representations on a variety of spatial downstream tasks, particularly in reverberant and noisy environments. We also demonstrate the utility of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing