Connecting Voices: LoReSpeech as a Low-Resource Speech Parallel Corpus

Samy Ouzerrout

arXiv:2502.18215·cs.CL·March 11, 2026

Connecting Voices: LoReSpeech as a Low-Resource Speech Parallel Corpus

Samy Ouzerrout

PDF

Open Access

TL;DR

This paper presents LoReSpeech, a methodology for creating low-resource speech-to-speech translation corpora to support NLP technologies for underrepresented languages, combining short aligned audios with long-form recordings.

Contribution

It introduces a novel approach to constructing low-resource speech corpora by combining short aligned audios with long-form recordings, facilitating multilingual speech technologies.

Findings

01

Created LoReASR sub-corpus with aligned short audios

02

Aligned long-form recordings using MFA tools

03

Enabled advancements in multilingual ASR and speech translation

Abstract

Aligned audio corpora are fundamental to NLP technologies such as ASR and speech translation, yet they remain scarce for underrepresented languages, hindering their technological integration. This paper introduces a methodology for constructing LoReSpeech, a low-resource speech-to-speech translation corpus. Our approach begins with LoReASR, a sub-corpus of short audios aligned with their transcriptions, created through a collaborative platform. Building on LoReASR, long-form audio recordings, such as biblical texts, are aligned using tools like the MFA. LoReSpeech delivers both intra- and inter-language alignments, enabling advancements in multilingual ASR systems, direct speech-to-speech translation models, and linguistic preservation efforts, while fostering digital inclusivity. This work is conducted within Tutlayt AI project (https://tutlayt.fr).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT Networks and Protocols