Multi-Stage Speaker Diarization for Noisy Classrooms
Ali Sartaz Khan, Tolulope Ogunremi, Ahmed Adel Attia, Dorottya Demszky

TL;DR
This paper evaluates multi-stage speaker diarization in noisy classrooms, demonstrating that denoising and hybrid VAD models significantly improve accuracy, with the integration of ASR timestamps further reducing errors in challenging acoustic conditions.
Contribution
It introduces a hybrid VAD approach combining ASR timestamps with frame-wise VAD and assesses its effectiveness in noisy classroom environments, advancing diarization techniques.
Findings
Denoising reduces missed speech and improves DER.
Training on combined denoised and noisy data enhances robustness.
Hybrid VAD achieves DER as low as 17% in teacher-student separation.
Abstract
Speaker diarization, the process of identifying "who spoke when" in audio recordings, is essential for understanding classroom dynamics. However, classroom settings present distinct challenges, including poor recording quality, high levels of background noise, overlapping speech, and the difficulty of accurately capturing children's voices. This study investigates the effectiveness of multi-stage diarization models using Nvidia's NeMo diarization pipeline. We assess the impact of denoising on diarization accuracy and compare various voice activity detection (VAD) models, including self-supervised transformer-based frame-wise VAD models. We also explore a hybrid VAD approach that integrates Automatic Speech Recognition (ASR) word-level timestamps with frame-level VAD predictions. We conduct experiments using two datasets from English speaking classrooms to separate teacher vs. student…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition
