Dual-Stream Decoupled Learning for Temporal Consistency and Speaker Interaction in AVSD

Junhao Xiao; Shun Feng; Zhiyu Wu; Jinghan Yu; Haibiao Yao; Zhiyuan Ma; Jianjun Li; Youjun Bao; Yi Chen

arXiv:2512.19130·cs.MM·April 17, 2026

Dual-Stream Decoupled Learning for Temporal Consistency and Speaker Interaction in AVSD

Junhao Xiao, Shun Feng, Zhiyu Wu, Jinghan Yu, Haibiao Yao, Zhiyuan Ma, Jianjun Li, Youjun Bao, Yi Chen

PDF

TL;DR

This paper introduces D$^2$Stream, a dual-stream framework that decouples temporal and social cues for improved audio-visual speaker detection, achieving state-of-the-art results.

Contribution

The paper proposes a novel decoupled dual-stream architecture that isolates temporal and social features, addressing conflicting biases in AVSD modeling.

Findings

01

Achieves 95.6% mAP on AVA-ActiveSpeaker

02

Demonstrates effective task decoupling through gradient analysis

03

Outperforms previous methods on Columbia ASD

Abstract

Audio-Visual Speaker Detection (AVSD) hinges on modeling both individual temporal continuity and inter-personal social context. Existing coupled architectures struggle to reconcile these tasks in shared representation spaces due to conflicting inductive biases: temporal modeling favors low-frequency smoothness, while inter-personal interaction requires high-frequency discriminability. We propose D $^{2}$ Stream, a decoupled dual-stream framework that explicitly isolates these functionalities into parallel, task-specific branches. Specifically, the Intra-speaker Temporal Continuity (ITC) stream captures longitudinal stability, whereas the Inter-personal Social Relation (ISR) stream models transversal social cues. Quantitative gradient analysis reveals an evolutionary divergence in update directions, stabilizing at 86.1{\deg}, which confirms the inherent task conflict and the effectiveness of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.