Multi-Stage Speaker Diarization for Noisy Classrooms

Ali Sartaz Khan; Tolulope Ogunremi; Ahmed Adel Attia; Dorottya Demszky

arXiv:2505.10879·cs.SD·May 28, 2025

Multi-Stage Speaker Diarization for Noisy Classrooms

Ali Sartaz Khan, Tolulope Ogunremi, Ahmed Adel Attia, Dorottya Demszky

PDF

Open Access 1 Repo

TL;DR

This paper evaluates multi-stage speaker diarization in noisy classrooms, demonstrating that denoising and hybrid VAD models significantly improve accuracy, with the integration of ASR timestamps further reducing errors in challenging acoustic conditions.

Contribution

It introduces a hybrid VAD approach combining ASR timestamps with frame-wise VAD and assesses its effectiveness in noisy classroom environments, advancing diarization techniques.

Findings

01

Denoising reduces missed speech and improves DER.

02

Training on combined denoised and noisy data enhances robustness.

03

Hybrid VAD achieves DER as low as 17% in teacher-student separation.

Abstract

Speaker diarization, the process of identifying "who spoke when" in audio recordings, is essential for understanding classroom dynamics. However, classroom settings present distinct challenges, including poor recording quality, high levels of background noise, overlapping speech, and the difficulty of accurately capturing children's voices. This study investigates the effectiveness of multi-stage diarization models using Nvidia's NeMo diarization pipeline. We assess the impact of denoising on diarization accuracy and compare various voice activity detection (VAD) models, including self-supervised transformer-based frame-wise VAD models. We also explore a hybrid VAD approach that integrates Automatic Speech Recognition (ASR) word-level timestamps with frame-level VAD predictions. We conduct experiments using two datasets from English speaking classrooms to separate teacher vs. student…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

edunlp/nemo-multistage-classroom-diarization
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition