RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks
Alexandra Diaconu, M\u{a}d\u{a}lina V\^inaga, Bogdan Alexe

TL;DR
RO-N3WS is a diverse Romanian speech dataset that enhances the generalization of low-resource ASR systems, demonstrating significant improvements through fine-tuning and synthetic data augmentation.
Contribution
The paper introduces RO-N3WS, a new benchmark dataset for Romanian speech recognition, and evaluates its effectiveness in improving ASR performance in low-resource and diverse domains.
Findings
Fine-tuning on RO-N3WS significantly reduces WER.
Synthetic TTS data aids in model robustness.
State-of-the-art models benefit from the dataset in low-resource settings.
Abstract
We introduce RO-N3WS, a benchmark Romanian speech dataset designed to improve generalization in automatic speech recognition (ASR), particularly in low-resource and out-of-distribution (OOD) conditions. RO-N3WS comprises over 126 hours of transcribed audio collected from broadcast news, literary audiobooks, film dialogue, children's stories, and conversational podcast speech. This diversity enables robust training and fine-tuning across stylistically distinct domains. We evaluate several state-of-the-art ASR systems (Whisper, Wav2Vec 2.0) in both zero-shot and fine-tuned settings, and conduct controlled comparisons using synthetic data generated with expressive TTS models. Our results show that even limited fine-tuning on real speech from RO-N3WS yields substantial WER improvements over zero-shot baselines. We will release all models, scripts, and data splits to support reproducible…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The strengths of this course are: 1. Novelty: A benchmark‑ready Romanian corpus explicitly structured for in‑domain vs. OOD evaluation. 2. Solid experimental design: multiple model families (wav2vec 2.0, Whisper), model sizes, a Romanian‑specific baseline, and commercial APIs; per‑domain WER reporting (ProTV vs. Antena 1; four OOD types). Includes multi‑run variability for selected fine‑tuning setups and detailed hyperparameters. The scoring discussion (formatting mismatches, numbers) shows awar
The weaknesses of the paper are: 1. Data Rights, Licensing, and Ethics Not Fully Specified: Broadcast news and film audio are likely copyright‑protected; the paper does not clearly state the legal basis for redistribution (e.g., licenses, permissions, time‑bounded usage, “research only,” or derivative transcription rights). OOD content sourced from YouTube (films, stories, podcasts) may have rights holders and terms of service implications. In additional, the paper should clarify PII handling an
This is a well written paper and the dataset and code will be useful to researchers in the area of automatic speech recognition. The exploration of the robustness to out of distributions scenarios is good to see The significant annotation effort and attention to data quality reported will increase the utility of the dataset
Issues related to dataset permissions have not been addressed and copyright issues related to broadcast news should be discussed. The motivation for focusing on broadcast news should be elaborated. What use cases are envisioned for systems trained using this dataset.
* New dataset for Romanian speech. * Quantitative analysis of the dataset (e.g., NER density, prosodic features), showing it is more lexically and stylistically diverse than existing corpora * Solid benchmark, including open-source and commercial models * Data, models and script are to be released
* Critically Narrow Scope (specifically for ICLR): The entire contribution is scoped to a single, low-resource language (Romanian). It might be a better fit on a speech recognition conference. * YODAS not mentioned / compared. * The TTS experiments should be done differently (see comments below). * Finetuning can probably be improved (see comments below). * Some smaller aspects are not clear (see questions below).
1. The motivation behind the paper is good, targeting to improve robustness for Romanian, a relatively low-resource language. 2. The dataset analysis provides critical insights for Romanian audio features.
1. The contribution is minimal, as the work is largely confined to standard data acquisition such as web crawling, annotating, and cleaning, and is only targeted at one language. 2. The paper does not adequately discuss or compare with other related work that is potentially heavily relevant (see Questions). 3. The dataset size of 126 hours remains relatively small, even for a low-resource language.
- Dataset contribution. RO-N3WS fills a clear gap in the Romanian ASR landscape, ffering diverse, well-curated data that spans broadcast, expressive, and conversational speech—missing in prior corpora. - Benchmarks cover both open-source (Whisper, Wav2Vec 2.0) and commercial systems under both zero-shot and fine-tuned settings. The experiments are thorough. - Good to see realistic generalization testing. The OOD sets (films, podcasts, audiobooks) provide a robust and practical way to test adapta
- The investigation taken in this paper is mainly empirical, using well-known techniques, with minor contributions in models and methods. - Limited linguistic diversity. Although RO-N3WS is valuable, it is monolingual (Romanian). - The data size of 126 hours is modest.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
