The FlySpeech Audio-Visual Speaker Diarization System for MISP Challenge 2022
Li Zhang, Huan Zhao, Yue Li, Bowen Pang, Yannan Wang, Hongji Wang, Wei, Rao, Qing Wang, Lei Xie

TL;DR
This paper presents FlySpeech, an end-to-end audio-visual speaker diarization system for the MISP Challenge 2022, utilizing joint training and large-data pretrained speaker extractors to improve performance.
Contribution
The novel contribution is the development of a jointly trained audio-visual diarization system with large-data pretrained components for enhanced accuracy.
Findings
Achieved improved diarization accuracy in the MISP Challenge
Demonstrated effectiveness of joint training for audio-visual models
Leveraged large-data pretrained speaker extractors for initialization
Abstract
This paper describes the FlySpeech speaker diarization system submitted to the second \textbf{M}ultimodal \textbf{I}nformation Based \textbf{S}peech \textbf{P}rocessing~(\textbf{MISP}) Challenge held in ICASSP 2022. We develop an end-to-end audio-visual speaker diarization~(AVSD) system, which consists of a lip encoder, a speaker encoder, and an audio-visual decoder. Specifically, to mitigate the degradation of diarization performance caused by separate training, we jointly train the speaker encoder and the audio-visual decoder. In addition, we leverage the large-data pretrained speaker extractor to initialize the speaker encoder.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
