IsoNet: Spatially-aware audio-visual target speech extraction in complex acoustic environments
Dinanath Padhya, Sajen Maharjan, Binita Adhikari, and Ishwor Raj Pokharel

TL;DR
IsoNet is a novel audio-visual speech extraction system for small microphone arrays, combining spatial cues and visual info to outperform traditional beamformers in challenging noisy environments.
Contribution
The paper introduces IsoNet, a multimodal neural network that improves speech extraction in compact devices using spatial and visual cues, surpassing classical beamforming methods.
Findings
IsoNet-CL1 achieves 9.31 dB SI-SDR on challenging test sets.
It outperforms delay-and-sum and MVDR beamformers by approximately 4.8 and 6.1 dB SI-SDRi.
Ablation studies confirm the benefits of visual conditioning and spatial features.
Abstract
Target speech extraction remains difficult for compact devices because monaural neural models lack spatial evidence and classical beamformers lose resolving power when the microphone aperture is only a few centimetres. We present IsoNet, a user-selectable audio-visual target speech extraction system for a compact 4-microphone array. IsoNet combines complex multi-channel STFT features, GCC-PHAT spatial cues, face-conditioned visual embeddings, and auxiliary direction-of-arrival supervision inside a U-Net mask estimation network. Three curriculum variants were trained on 25,000 simulated VoxCeleb mixtures with progressively difficult SNR regimes. On a hard test set spanning -1 to 10 dB SNR, IsoNet-CL1 achieves 9.31 dB SI-SDR, a 4.85 dB improvement over the mixture, with PESQ 2.13 and STOI 0.84. Oracle delay-and-sum and MVDR beamformers degrade the same mixtures by 4.82 dB and 6.08 dB…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
