Cross-attention and Self-attention for Audio-visual Speaker Diarization in MISP-Meeting Challenge

Zhaoyang Li; Haodong Zhou; Longjie Luo; Xiaoxiao Li; Yongxin Chen; Lin Li; Qingyang Hong

arXiv:2506.02621·cs.SD·June 4, 2025

Cross-attention and Self-attention for Audio-visual Speaker Diarization in MISP-Meeting Challenge

Zhaoyang Li, Haodong Zhou, Longjie Luo, Xiaoxiao Li, Yongxin Chen, Lin Li, Qingyang Hong

PDF

Open Access

TL;DR

This paper introduces CASA-Net, a novel end-to-end audio-visual speaker diarization system utilizing cross-attention and self-attention modules, achieving significant error rate reduction in the MISP-2025 Challenge.

Contribution

The paper presents CASA-Net with integrated cross-attention and self-attention modules, along with a training strategy and post-processing, to improve audio-visual speaker diarization accuracy.

Findings

01

Achieved a DER of 8.18% on evaluation set

02

Reduced diarization error rate by 47.3% compared to baseline

03

Enhanced timestamp prediction accuracy through pseudo-label refinement

Abstract

This paper presents the system developed for Task 1 of the Multi-modal Information-based Speech Processing (MISP) 2025 Challenge. We introduce CASA-Net, an embedding fusion method designed for end-to-end audio-visual speaker diarization (AVSD) systems. CASA-Net incorporates a cross-attention (CA) module to effectively capture cross-modal interactions in audio-visual signals and employs a self-attention (SA) module to learn contextual relationships among audio-visual frames. To further enhance performance, we adopt a training strategy that integrates pseudo-label refinement and retraining, improving the accuracy of timestamp predictions. Additionally, median filtering and overlap averaging are applied as post-processing techniques to eliminate outliers and smooth prediction labels. Our system achieved a diarization error rate (DER) of 8.18% on the evaluation set, representing a relative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing