The Greek podcast corpus: Competitive speech models for low-resourced   languages with weakly supervised data

Georgios Paraskevopoulos; Chara Tsoukala; Athanasios Katsamanis,; Vassilis Katsouros

arXiv:2406.15284·cs.CL·June 24, 2024

The Greek podcast corpus: Competitive speech models for low-resourced languages with weakly supervised data

Georgios Paraskevopoulos, Chara Tsoukala, Athanasios Katsamanis,, Vassilis Katsouros

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that large, weakly supervised speech corpora can significantly improve automatic speech recognition for low-resource languages like Greek, using podcast data and modern models.

Contribution

It introduces an 800-hour Greek podcast corpus with weak supervision and evaluates its effectiveness in enhancing ASR models for under-resourced languages.

Findings

01

WER improvements with increased data volume

02

Model size correlates with performance gains

03

Weakly supervised data is cost-effective for low-resource ASR

Abstract

The development of speech technologies for languages with limited digital representation poses significant challenges, primarily due to the scarcity of available data. This issue is exacerbated in the era of large, data-intensive models. Recent research has underscored the potential of leveraging weak supervision to augment the pool of available data. In this study, we compile an 800-hour corpus of Modern Greek from podcasts and employ Whisper large-v3 to generate silver transcriptions. This corpus is utilized to fine-tune our models, aiming to assess the efficacy of this approach in enhancing ASR performance. Our analysis spans 16 distinct podcast domains, alongside evaluations on established datasets for Modern Greek. The findings indicate consistent WER improvements, correlating with increases in both data volume and model size. Our study confirms that assembling large, weakly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

georgepar/greek_podcasts_asr
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques