Audiovisual Singing Voice Separation
Bochen Li, Yuxuan Wang, and Zhiyao Duan

TL;DR
This paper introduces a novel audiovisual singing voice separation method that leverages mouth movement video cues and additional vocal signals during training, significantly improving separation quality especially with backing vocals.
Contribution
It presents a new audiovisual approach incorporating mouth movement and auxiliary vocal signals, along with two new datasets, enhancing singing voice separation performance.
Findings
Outperforms audio-only methods in separation quality.
Particularly effective with backing vocals present.
Demonstrates the benefit of visual cues and auxiliary signals in training.
Abstract
Separating a song into vocal and accompaniment components is an active research topic, and recent years witnessed an increased performance from supervised training using deep learning techniques. We propose to apply the visual information corresponding to the singers' vocal activities to further improve the quality of the separated vocal signals. The video frontend model takes the input of mouth movement and fuses it into the feature embeddings of an audio-based separation framework. To facilitate the network to learn audiovisual correlation of singing activities, we add extra vocal signals irrelevant to the mouth movement to the audio mixture during training. We create two audiovisual singing performance datasets for training and evaluation, respectively, one curated from audition recordings on the Internet, and the other recorded in house. The proposed method outperforms audio-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Music Technology and Sound Studies
