# Audio-visual source separation with localization and individual control

**Authors:** Mohanaprasad Kothandaraman, Balakrishnan Ramalingam, Kai Sheng, Aman Verma, Utkarsh Dhagat, Pranav Parab, Siddhartha Mallavolu, Sankar Ganesh

PMC · DOI: 10.1371/journal.pone.0321856 · PLOS One · 2025-05-23

## TL;DR

This paper introduces a system that separates and controls individual voices in video conferences using audio and visual cues.

## Contribution

The novel pipeline integrates audio-visual features and deep learning to isolate and control individual speakers in video conferencing.

## Key findings

- The system achieves 71.88% test accuracy on the AVE dataset.
- It effectively isolates and amplifies individual participants' speech in noisy environments.
- The framework includes a DPRNN-TasNet module for human voice separation.

## Abstract

The growing reliance on video conferencing software brings significant benefits but also introduces challenges, particularly in managing audio quality. In multi-participant settings, ambient noise and interruptions can hinder speaker recognition and disrupt the flow of conversation. This work proposes an audio-visual source separation pipeline designed specifically for video conferencing and telepresence robots applications. The framework aims to isolate and enhance the speech of individual participants in noisy environments while enabling control over the volume of specific individuals captured in the video frame. The proposed pipeline comprises key components: a deep learning-based feature extractor for audio and video, an audio-guided visual attention mechanism, a module for background noise suppression and human voice separation, and Deep Multi-Resolution Network (DMRN) modules. For human voice separation, the DPRNN-TasNet, a robust deep neural network framework, is employed. Experimental results demonstrate that the methodology effectively isolates and amplifies individual participants’ speech, achieving a test accuracy of 71.88 % on both the AVE and Music 21 datasets.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12101657/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12101657/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/PMC12101657/full.md

---
Source: https://tomesphere.com/paper/PMC12101657