Multi-Scale Temporal Convolution Network for Classroom Voice Detection
Lu Ma, Xintian Wang, Song Yang, Yaguang Gong, Zhongqin Wu

TL;DR
This paper introduces a multi-scale temporal convolution network for classifying classroom voice signals into four categories, improving the extraction of assistant teacher voices amidst interference for better downstream speech processing.
Contribution
It proposes a novel multi-scale temporal convolution neural network with dilated convolutions for frame-level sound event detection in classroom environments, enhancing voice classification accuracy.
Findings
High precision and recall on simulated data
Effective in real-world classroom recordings
Outperforms classical classification methods
Abstract
Teaching with the cooperation of expert teacher and assistant teacher, which is the so-called "double-teachers classroom", i.e., the course is giving by the expert online and presented through projection screen at the classroom, and the teacher at the classroom performs as an assistant for guiding the students in learning, is becoming more prevalent in today's teaching method for K-12 education. For monitoring the teaching quality, a microphone clipped on the assistant's neckline is always used for voice recording, then fed to the downstream tasks of automatic speech recognition (ASR) and neural language processing (NLP). However, besides its voice, there would be some other interfering voices, including the expert's one and the student's one. Here, we propose to extract the assistant' voices from the perspective of sound event detection, i.e., the voices are classified into four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Hand Gesture Recognition Systems
