The FlySpeech Audio-Visual Speaker Diarization System for MISP Challenge   2022

Li Zhang; Huan Zhao; Yue Li; Bowen Pang; Yannan Wang; Hongji Wang; Wei; Rao; Qing Wang; Lei Xie

arXiv:2307.15400·cs.SD·July 31, 2023·1 cites

The FlySpeech Audio-Visual Speaker Diarization System for MISP Challenge 2022

Li Zhang, Huan Zhao, Yue Li, Bowen Pang, Yannan Wang, Hongji Wang, Wei, Rao, Qing Wang, Lei Xie

PDF

Open Access

TL;DR

This paper presents FlySpeech, an end-to-end audio-visual speaker diarization system for the MISP Challenge 2022, utilizing joint training and large-data pretrained speaker extractors to improve performance.

Contribution

The novel contribution is the development of a jointly trained audio-visual diarization system with large-data pretrained components for enhanced accuracy.

Findings

01

Achieved improved diarization accuracy in the MISP Challenge

02

Demonstrated effectiveness of joint training for audio-visual models

03

Leveraged large-data pretrained speaker extractors for initialization

Abstract

This paper describes the FlySpeech speaker diarization system submitted to the second \textbf{M}ultimodal \textbf{I}nformation Based \textbf{S}peech \textbf{P}rocessing~(\textbf{MISP}) Challenge held in ICASSP 2022. We develop an end-to-end audio-visual speaker diarization~(AVSD) system, which consists of a lip encoder, a speaker encoder, and an audio-visual decoder. Specifically, to mitigate the degradation of diarization performance caused by separate training, we jointly train the speaker encoder and the audio-visual decoder. In addition, we leverage the large-data pretrained speaker extractor to initialize the speaker encoder.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques