Child Speech Recognition in Human-Robot Interaction: Problem Solved?
Ruben Janssens, Eva Verhelst, Giulio Antonio Abbo, Qiaoqiao Ren, Maria, Jose Pinto Bernal, Tony Belpaeme

TL;DR
Recent advances in data-driven speech recognition, especially Transformer models like OpenAI Whisper, significantly improve child speech recognition, enabling more effective human-robot interactions despite remaining challenges.
Contribution
This paper demonstrates that modern Transformer-based models substantially enhance child speech recognition performance, showing potential for real-time, autonomous child-robot communication.
Findings
OpenAI Whisper outperforms commercial cloud services.
Structured interactions improve recognition accuracy.
Achieves 60.3% sentence recognition with sub-second latency.
Abstract
Automated Speech Recognition shows superhuman performance for adult English speech on a range of benchmarks, but disappoints when fed children's speech. This has long sat in the way of child-robot interaction. Recent evolutions in data-driven speech recognition, including the availability of Transformer architectures and unprecedented volumes of training data, might mean a breakthrough for child speech recognition and social robot applications aimed at children. We revisit a study on child speech recognition from 2017 and show that indeed performance has increased, with newcomer OpenAI Whisper doing markedly better than leading commercial cloud services. Performance improves even more in highly structured interactions when priming models with specific phrases. While transcription is not perfect yet, the best model recognises 60.3% of sentences correctly barring small grammatical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Robotics and Automated Systems · Social Robot Interaction and HRI
MethodsAttention Is All You Need · Dropout · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Linear Layer · Dense Connections · Label Smoothing
