Enabling automatic transcription of child-centered audio recordings from real-world environments
Daniil Kocharov, Okko R\"as\"anen

TL;DR
This paper introduces a method to automatically identify and transcribe reliably speech segments in noisy, longform child-centered audio recordings, enabling scalable linguistic analysis with high accuracy on selected speech portions.
Contribution
It presents a novel approach to detect transcribable speech segments in longform audio, significantly improving transcription quality and enabling detailed linguistic analysis of child-centered recordings.
Findings
Median WER of 0% on selected segments
Transcription of 13% of speech with 18% mean WER
High correlation (r=0.92) between automatic and manual word frequencies
Abstract
Longform audio recordings obtained with microphones worn by children-also known as child-centered daylong recordings-have become a standard method for studying children's language experiences and their impact on subsequent language development. Transcripts of longform speech audio would enable rich analyses at various linguistic levels, yet the massive scale of typical longform corpora prohibits comprehensive manual annotation. At the same time, automatic speech recognition (ASR)-based transcription faces significant challenges due to the noisy, unconstrained nature of real-world audio, and no existing study has successfully applied ASR to transcribe such data. However, previous attempts have assumed that ASR must process each longform recording in its entirety. In this work, we present an approach to automatically detect those utterances in longform audio that can be reliably…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
