TL;DR
DOVER-Lap is a novel ensemble method for overlapping speaker diarization outputs that improves accuracy by combining diverse systems and can be used for late fusion in multichannel scenarios.
Contribution
It introduces a new algorithm for combining overlapping diarization outputs using weighted graph matching, extending the DOVER framework.
Findings
DOVER-Lap outperforms the best single system on AMI and LibriCSS datasets.
It effectively combines diverse diarization systems including clustering, RPN, and VAD.
The method is also effective for late fusion in multichannel diarization.
Abstract
Several advances have been made recently towards handling overlapping speech for speaker diarization. Since speech and natural language tasks often benefit from ensemble techniques, we propose an algorithm for combining outputs from such diarization systems through majority voting. Our method, DOVER-Lap, is inspired from the recently proposed DOVER algorithm, but is designed to handle overlapping segments in diarization outputs. We also modify the pair-wise incremental label mapping strategy used in DOVER, and propose an approximation algorithm based on weighted k-partite graph matching, which performs this mapping using a global cost tensor. We demonstrate the strength of our method by combining outputs from diverse systems -- clustering-based, region proposal networks, and target-speaker voice activity detection -- on AMI and LibriCSS datasets, where it consistently outperforms the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
