MOVER: Combining Multiple Meeting Recognition Systems
Naoyuki Kamo, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani

TL;DR
MOVER is a novel system that effectively combines diverse meeting recognition outputs, including diarization and ASR, to improve accuracy in complex multi-system scenarios.
Contribution
It introduces the first method capable of merging outputs from meeting recognition systems with different diarization and ASR components.
Findings
Achieves 9.55% relative tcpWER reduction on CHiME-8 DASR task.
Achieves 8.51% relative tcpWER reduction on NOTSOFAR-1 multi-channel task.
Demonstrates successful integration of diverse system outputs for improved recognition accuracy.
Abstract
In this paper, we propose Meeting recognizer Output Voting Error Reduction (MOVER), a novel system combination method for meeting recognition tasks. Although there are methods to combine the output of diarization (e.g., DOVER) or automatic speech recognition (ASR) systems (e.g., ROVER), MOVER is the first approach that can combine the outputs of meeting recognition systems that differ in terms of both diarization and ASR. MOVER combines hypotheses with different time intervals and speaker labels through a five-stage process that includes speaker alignment, segment grouping, word and timing combination, etc. Experimental results on the CHiME-8 DASR task and the multi-channel track of the NOTSOFAR-1 task demonstrate that MOVER can successfully combine multiple meeting recognition systems with diverse diarization and recognition outputs, achieving relative tcpWER improvements of 9.55 % and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
