HiACC: Hinglish adult & children code-switched corpus

Shruti Singh; Muskaan Singh; Virender Kadyan

PMC · DOI:10.1016/j.dib.2025.111886·July 17, 2025

HiACC: Hinglish adult & children code-switched corpus

Shruti Singh, Muskaan Singh, Virender Kadyan

PDF

Open Access

TL;DR

The paper introduces HiACC, a new Hinglish code-switched speech corpus for improving ASR systems, especially for children and adults in India.

Contribution

The paper presents the first publicly available code-switched Hinglish speech corpus with recordings from both adults and children.

Findings

01

HiACC includes 3,318 adult and 1,858 children audio segments with detailed annotations.

02

Baseline ASR models show a 42% increase in WER on code-switched speech compared to monolingual input.

03

The corpus is publicly available for research at the provided Zenodo link.

Abstract

Code-switching is the frequent alternation between two or more languages within a single utterance and is a widespread phenomenon among bilingual and multilingual speakers. In India, more than 250 million people are estimated to engage in code-switched communication, especially blending English with Hindi (Hinglish), making it one of the largest bilingual populations globally, making challenging for developing accurate and robust Automatic Speech Recognition (ASR) systems. Existing ASR models, typically trained on monolingual corpus, struggle with code-switched input due to a lack of large, balanced, and representative datasets—particularly for diverse age groups. Recent evaluations have shown that ASR models experience a relative increase in Word Error Rate (WER) of 30–50 % when exposed to code-switched speech compared to monolingual input. To address this resource gap, we introduce a…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases2

XLS-R speech disorders

Figures6

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research