TICL+: A Case Study On Speech In-Context Learning for Children's Speech Recognition
Haolong Zheng, Yekaterina Yegorova, Mark Hasegawa-Johnson

TL;DR
This paper introduces TICL+, an enhanced speech in-context learning method for children's speech recognition that combines semantic and acoustic example selection, significantly improving accuracy over previous approaches.
Contribution
The paper proposes TICL+, a novel extension of TICL that incorporates acoustic reranking, improving example selection for better children's speech recognition without fine-tuning.
Findings
TICL+ reduces word error rate by up to 53.3% relative to zero-shot.
TICL+ outperforms baseline TICL by 37.6%.
Combining semantic and acoustic information enhances ASR robustness.
Abstract
Children's speech recognition remains challenging due to substantial acoustic and linguistic variability, limited labeled data, and significant differences from adult speech. Speech foundation models can address these challenges through Speech In-Context Learning (SICL), allowing adaptation to new domains without fine-tuning. However, the effectiveness of SICL depends on how in-context examples are selected. We extend an existing retrieval-based method, Text-Embedding KNN for SICL (TICL), introducing an acoustic reranking step to create TICL+. This extension prioritizes examples that are both semantically and acoustically aligned with the test input. Experiments on four children's speech corpora show that TICL+ achieves up to a 53.3% relative word error rate reduction over zero-shot performance and 37.6% over baseline TICL, highlighting the value of combining semantic and acoustic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Language Development and Disorders · Speech and Audio Processing
