DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors

Federico Landini; Mireia Diez; Themos Stafylakis; Luk\'a\v{s} Burget

arXiv:2312.04324·eess.AS·June 4, 2024·1 cites

DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors

Federico Landini, Mireia Diez, Themos Stafylakis, Luk\'a\v{s} Burget

PDF

Open Access 1 Repo

TL;DR

DiaPer introduces a Perceiver-based attractor module into end-to-end neural diarization, achieving improved accuracy, better speaker count estimation, and faster inference over existing models like EEND-EDA.

Contribution

The paper presents a novel Perceiver-based attractor module for end-to-end neural diarization, enhancing performance and efficiency compared to prior models.

Findings

01

Better performance on Callhome dataset

02

More accurate speaker count estimation

03

Faster inference time

Abstract

Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based one and show its advantages over EEND-EDA; namely obtaining better performance on the largely studied Callhome dataset, finding the quantity of speakers in a conversation more accurately, and faster inference time. Furthermore, when exhaustively compared with other methods, our model, DiaPer, reaches remarkable performance with a very lightweight design. Besides, we perform comparisons with other works and a cascaded baseline across more than ten public wide-band datasets. Together with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

butspeechfit/diaper
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing