IsoNet: Spatially-aware audio-visual target speech extraction in complex acoustic environments

Dinanath Padhya; Sajen Maharjan; Binita Adhikari; and Ishwor Raj Pokharel

arXiv:2605.14736·cs.SD·May 18, 2026

IsoNet: Spatially-aware audio-visual target speech extraction in complex acoustic environments

Dinanath Padhya, Sajen Maharjan, Binita Adhikari, and Ishwor Raj Pokharel

PDF

TL;DR

IsoNet is a novel audio-visual speech extraction system for small microphone arrays, combining spatial cues and visual info to outperform traditional beamformers in challenging noisy environments.

Contribution

The paper introduces IsoNet, a multimodal neural network that improves speech extraction in compact devices using spatial and visual cues, surpassing classical beamforming methods.

Findings

01

IsoNet-CL1 achieves 9.31 dB SI-SDR on challenging test sets.

02

It outperforms delay-and-sum and MVDR beamformers by approximately 4.8 and 6.1 dB SI-SDRi.

03

Ablation studies confirm the benefits of visual conditioning and spatial features.

Abstract

Target speech extraction remains difficult for compact devices because monaural neural models lack spatial evidence and classical beamformers lose resolving power when the microphone aperture is only a few centimetres. We present IsoNet, a user-selectable audio-visual target speech extraction system for a compact 4-microphone array. IsoNet combines complex multi-channel STFT features, GCC-PHAT spatial cues, face-conditioned visual embeddings, and auxiliary direction-of-arrival supervision inside a U-Net mask estimation network. Three curriculum variants were trained on 25,000 simulated VoxCeleb mixtures with progressively difficult SNR regimes. On a hard test set spanning -1 to 10 dB SNR, IsoNet-CL1 achieves 9.31 dB SI-SDR, a 4.85 dB improvement over the mixture, with PESQ 2.13 and STOI 0.84. Oracle delay-and-sum and MVDR beamformers degrade the same mixtures by 4.82 dB and 6.08 dB…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.