ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and   Development

Yanir Marmor; Kinneret Misgav; Yair Lifshitz

arXiv:2307.08720·eess.AS·July 19, 2023

ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development

Yanir Marmor, Kinneret Misgav, Yair Lifshitz

PDF

Open Access 2 Repos 1 Models 5 Datasets

TL;DR

The paper introduces ivrit.ai, a large, accessible Hebrew speech dataset with diverse speakers and formats, designed to advance Hebrew ASR technology and AI applications.

Contribution

It provides the first extensive, legally accessible Hebrew speech dataset with multiple data formats, supporting diverse research and development needs.

Findings

01

Over 3,300 hours of speech data collected

02

Contains data from over a thousand speakers

03

Available in raw, processed, and transcribed forms

Abstract

We introduce "ivrit.ai", a comprehensive Hebrew speech dataset, addressing the distinct lack of extensive, high-quality resources for advancing Automated Speech Recognition (ASR) technology in Hebrew. With over 3,300 speech hours and a over a thousand diverse speakers, ivrit.ai offers a substantial compilation of Hebrew speech across various contexts. It is delivered in three forms to cater to varying research needs: raw unprocessed audio; data post-Voice Activity Detection, and partially transcribed data. The dataset stands out for its legal accessibility, permitting use at no cost, thereby serving as a crucial resource for researchers, developers, and commercial entities. ivrit.ai opens up numerous applications, offering vast potential to enhance AI capabilities in Hebrew. Future efforts aim to expand ivrit.ai further, thereby advancing Hebrew's standing in AI research and technology.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
sivan22/faster-whisper-ivrit-ai-whisper-large-v2-tuned
model· 9 dl· ♡ 2
9 dl♡ 2

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling