QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic   Speech Corpus

Hamdy Mubarak; Amir Hussein; Shammur Absar Chowdhury; Ahmed Ali

arXiv:2106.13000·cs.CL·June 25, 2021

QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus

Hamdy Mubarak, Amir Hussein, Shammur Absar Chowdhury, Ahmed Ali

PDF

Open Access 2 Models

TL;DR

QASR is the largest annotated Arabic speech corpus from broadcast media, enabling advancements in speech recognition, dialect identification, and NLP tasks with extensive transcribed speech and auxiliary language data.

Contribution

This paper introduces QASR, the largest multi-dialect Arabic speech dataset with detailed annotations, and provides baseline results for speech recognition and NLP tasks.

Findings

01

End-to-end speech recognition achieves competitive WER.

02

Baseline results for Arabic punctuation restoration.

03

Resource availability for future research.

Abstract

We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain. This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel. The dataset is released with lightly supervised transcriptions, aligned with the audio segments. Unlike previous datasets, QASR contains linguistically motivated segmentation, punctuation, speaker information among others. QASR is suitable for training and evaluating speech recognition systems, acoustics- and/or linguistics- based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other NLP modules for spoken data. In addition to QASR transcription, we release a dataset of 130M words to aid in designing and training a better language model. We show that end-to-end automatic speech recognition trained on QASR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing