My Science Tutor (MyST) -- A Large Corpus of Children's Conversational Speech
Sameer S. Pradhan, Ronald A. Cole, Wayne H. Ward

TL;DR
The paper introduces the MyST corpus, a large, publicly available collection of children's conversational speech from educational sessions, aimed at advancing speech recognition and conversational AI for educational purposes.
Contribution
The creation and release of one of the largest children's conversational speech corpora, with extensive transcriptions and broad accessibility for research and commercial use.
Findings
Approximately 400 hours of speech data collected
100K transcribed utterances available for research
Corpus licensed by multiple organizations for diverse applications
Abstract
This article describes the MyST corpus developed as part of the My Science Tutor project -- one of the largest collections of children's conversational speech comprising approximately 400 hours, spanning some 230K utterances across about 10.5K virtual tutor sessions by around 1.3K third, fourth and fifth grade students. 100K of all utterances have been transcribed thus far. The corpus is freely available (https://myst.cemantix.org) for non-commercial use using a creative commons license. It is also available for commercial use (https://boulderlearning.com/resources/myst-corpus/). To date, ten organizations have licensed the corpus for commercial use, and approximately 40 university and other not-for-profit research groups have downloaded the corpus. It is our hope that the corpus can be used to improve automatic speech recognition algorithms, build and evaluate conversational AI agents…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
