VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer

Juan F. Montesinos; Venkatesh S. Kadandale; Gloria Haro

arXiv:2203.04099·cs.SD·July 20, 2022·1 cites

VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer

Juan F. Montesinos, Venkatesh S. Kadandale, Gloria Haro

PDF

Open Access 1 Repo

TL;DR

VoViT introduces a low-latency, graph-based audio-visual transformer for voice separation, achieving state-of-the-art results in speech and singing voice scenarios with efficient processing.

Contribution

The paper proposes a novel two-stage audio-visual transformer model utilizing graph convolutional networks for face landmarks and demonstrates transferability to singing voice separation.

Findings

01

State-of-the-art performance in low-latency voice separation

02

Effective use of face landmarks for motion cues

03

Transferability from speech to singing voice separation

Abstract

This paper presents an audio-visual approach for voice separation which produces state-of-the-art results at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced with an audio-only network. We present different ablation studies and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained for speech separation in the task of singing voice separation. The demos, code, and weights are available in https://ipcv.github.io/VoViT/

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JuanFMontesinos/VoViT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis