TL;DR
This paper presents a neural network-based method for detecting sample drops in multi-device, multi-microphone speech recordings, improving synchronization and data quality in distributed distant-speech recognition systems.
Contribution
It introduces a CNN-LSTM with multi-head attention model for sample drop detection, validated on real and artificial multi-channel speech data, achieving high F1 scores.
Findings
Achieved 88% F1 score on CHiME-5 corpus.
Effective detection of sample drops in distributed microphone arrays.
Robust performance on artificial multi-channel data.
Abstract
In many applications of multi-microphone multi-device processing, the synchronization among different input channels can be affected by the lack of a common clock and isolated drops of samples. In this work, we address the issue of sample drop detection in the context of a conversational speech scenario, recorded by a set of microphones distributed in space. The goal is to design a neural-based model that given a short window in the time domain, detects whether one or more devices have been subjected to a sample drop event. The candidate time windows are selected from a set of large time intervals, possibly including a sample drop, and by using a preprocessing step. The latter is based on the application of normalized cross-correlation between signals acquired by different devices. The architecture of the neural network relies on a CNN-LSTM encoder, followed by multi-head attention. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
