Online Audio-Visual Autoregressive Speaker Extraction

Zexu Pan; Wupeng Wang; Shengkui Zhao; Chong Zhang; Kun Zhou; Yukun Ma; Bin Ma

arXiv:2506.01270·eess.AS·June 3, 2025

Online Audio-Visual Autoregressive Speaker Extraction

Zexu Pan, Wupeng Wang, Shengkui Zhao, Chong Zhang, Kun Zhou, Yukun Ma, Bin Ma

PDF

Open Access

TL;DR

This paper introduces a lightweight online audio-visual speaker extraction model that effectively utilizes visual cues and autoregressive acoustic encoding, demonstrating robustness and improved performance in streaming scenarios.

Contribution

It presents a novel lightweight visual frontend and autoregressive acoustic encoder for online speaker extraction, addressing focus change scenarios.

Findings

01

Visual frontend matches state-of-the-art performance with minimal parameters.

02

Autoregressive encoder improves SI-SNRi by 0.9 dB.

03

Model remains robust when target speaker focus shifts.

Abstract

This paper proposes a novel online audio-visual speaker extraction model. In the streaming regime, most studies optimize the audio network only, leaving the visual frontend less explored. We first propose a lightweight visual frontend based on depth-wise separable convolution. Then, we propose a lightweight autoregressive acoustic encoder to serve as the second cue, to actively explore the information in the separated speech signal from past steps. Scenario-wise, for the first time, we study how the algorithm performs when there is a change in focus of attention, i.e., the target speaker. Experimental results on LRS3 datasets show that our visual frontend performs comparably to the previous state-of-the-art on both SkiM and ConvTasNet audio backbones with only 0.1 million network parameters and 2.1 MACs per second of processing. The autoregressive acoustic encoder provides an additional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsConvolutional time-domain audio separation network · Focus