Mask-based Neural Beamforming for Moving Speakers with   Self-Attention-based Tracking

Tsubasa Ochiai; Marc Delcroix; Tomohiro Nakatani; Shoko Araki

arXiv:2205.03568·eess.AS·May 10, 2022·1 cites

Mask-based Neural Beamforming for Moving Speakers with Self-Attention-based Tracking

Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Shoko Araki

PDF

Open Access

TL;DR

This paper introduces a neural network with self-attention layers to compute optimal attention weights for mask-based beamforming, significantly improving performance for moving speakers while maintaining effectiveness for stationary sources.

Contribution

It proposes a learning-based framework using self-attention to adaptively compute attention weights for beamforming with moving sources, surpassing classical heuristic methods.

Findings

01

Enhanced beamforming performance with moving sources.

02

Maintained high performance for stationary sources.

03

Demonstrated effectiveness of self-attention in source tracking.

Abstract

Beamforming is a powerful tool designed to enhance speech signals from the direction of a target source. Computing the beamforming filter requires estimating spatial covariance matrices (SCMs) of the source and noise signals. Time-frequency masks are often used to compute these SCMs. Most studies of mask-based beamforming have assumed that the sources do not move. However, sources often move in practice, which causes performance degradation. In this paper, we address the problem of mask-based beamforming for moving sources. We first review classical approaches to tracking a moving source, which perform online or blockwise computation of the SCMs. We show that these approaches can be interpreted as computing a sum of instantaneous SCMs weighted by attention weights. These weights indicate which time frames of the signal to consider in the SCM computation. Online or blockwise computation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Underwater Acoustics Research · Indoor and Outdoor Localization Technologies