DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors
Federico Landini, Mireia Diez, Themos Stafylakis, Luk\'a\v{s} Burget

TL;DR
DiaPer introduces a Perceiver-based attractor module into end-to-end neural diarization, achieving improved accuracy, better speaker count estimation, and faster inference over existing models like EEND-EDA.
Contribution
The paper presents a novel Perceiver-based attractor module for end-to-end neural diarization, enhancing performance and efficiency compared to prior models.
Findings
Better performance on Callhome dataset
More accurate speaker count estimation
Faster inference time
Abstract
Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based one and show its advantages over EEND-EDA; namely obtaining better performance on the largely studied Callhome dataset, finding the quantity of speakers in a conversation more accurately, and faster inference time. Furthermore, when exhaustively compared with other methods, our model, DiaPer, reaches remarkable performance with a very lightweight design. Besides, we perform comparisons with other works and a cascaded baseline across more than ten public wide-band datasets. Together with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
