SAGE: Spliced-Audio Generated Data for Enhancing Foundational Models in Low-Resource Arabic-English Code-Switched Speech Recognition

Muhammad Umar Farooq; Oscar Saz

arXiv:2506.22143·cs.CL·June 30, 2025

SAGE: Spliced-Audio Generated Data for Enhancing Foundational Models in Low-Resource Arabic-English Code-Switched Speech Recognition

Muhammad Umar Farooq, Oscar Saz

PDF

Open Access

TL;DR

This paper introduces SAGE, a data augmentation method using spliced audio to improve low-resource Arabic-English code-switched speech recognition, achieving significant WER reductions and surpassing larger models.

Contribution

The paper presents a novel spliced-audio data generation technique and an experience replay approach to enhance speech recognition in low-resource, dialectal, and code-switched Arabic-English contexts.

Findings

01

7.8% absolute WER improvement with SAGE data

02

Outperforms larger multilingual models on CS benchmarks

03

Reduces WER from 31.7% to 26.6% with language model integration

Abstract

This paper investigates the performance of various speech SSL models on dialectal Arabic (DA) and Arabic-English code-switched (CS) speech. To address data scarcity, a modified audio-splicing approach is introduced to generate artificial CS speech data. Fine-tuning an already fine-tuned SSL model with the proposed Spliced-Audio Generated (SAGE) data results in an absolute improvement on Word Error Rate (WER) of 7.8% on Arabic and English CS benchmarks. Additionally, an Experience Replay (ER) inspired approach is proposed to enhance generalisation across DA and CS speech while mitigating catastrophic forgetting. Integrating an out-of-domain 3-gram language model reduces the overall mean WER from 31.7% to 26.6%. Few-shot fine-tuning for code-switching benchmarks further improves WER by 4.9%. A WER of 31.1% on Arabic-English CS benchmarks surpasses large-scale multilingual models,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research