TL;DR
This paper introduces a self-contained TTS-STT flywheel system that synthesizes entity-dense Indic speech to significantly improve ASR performance on niche Indic domains, surpassing existing open-source and commercial systems.
Contribution
It presents an open-source Indic TTS pipeline combined with LoRA fine-tuning to close the ASR gap in niche domains, achieving 17x higher Entity-Hit-Rate than state-of-the-art models.
Findings
Achieved EHR of 0.473 on Telugu test set, 17x over open SOTA.
Synthesized 22,000 entity-dense utterances at <$50 cost.
Native human recordings confirm transfer to real speech.
Abstract
Niche-domain Indic ASR -- digit strings, currency amounts, addresses, brand names, English/Indic codemix -- is under-served by both open-source SOTA and commercial systems. On a synthesised entity-dense Telugu test set (held-out by synthesis system), vasista22/whisper-telugu-large-v2 (open SOTA) achieves Entity-Hit-Rate (EHR) 0.027 and Deepgram Nova-3 (commercial) 0.16. We close this gap with a self-contained TTS<->STT flywheel: an open-source Indic TTS pipeline synthesises ~22,000 entity-dense Indic-English code-mix utterances at <$50 marginal cost, and a LoRA fine-tune on top of vasista22 achieves EHR 0.473 on the held-out test (17x over open SOTA, 3x over commercial), with read-prose regression bounded to +6.6 pp WER on FLEURS-Te. Cross-language: beta-Hi 0.337 (7x vs vasista22) and beta-Ta 0.543 (22x vs vasista22, 22x vs Deepgram); on Hindi where Deepgram has substantial entity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
