ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development
Yanir Marmor, Kinneret Misgav, Yair Lifshitz

TL;DR
The paper introduces ivrit.ai, a large, accessible Hebrew speech dataset with diverse speakers and formats, designed to advance Hebrew ASR technology and AI applications.
Contribution
It provides the first extensive, legally accessible Hebrew speech dataset with multiple data formats, supporting diverse research and development needs.
Findings
Over 3,300 hours of speech data collected
Contains data from over a thousand speakers
Available in raw, processed, and transcribed forms
Abstract
We introduce "ivrit.ai", a comprehensive Hebrew speech dataset, addressing the distinct lack of extensive, high-quality resources for advancing Automated Speech Recognition (ASR) technology in Hebrew. With over 3,300 speech hours and a over a thousand diverse speakers, ivrit.ai offers a substantial compilation of Hebrew speech across various contexts. It is delivered in three forms to cater to varying research needs: raw unprocessed audio; data post-Voice Activity Detection, and partially transcribed data. The dataset stands out for its legal accessibility, permitting use at no cost, thereby serving as a crucial resource for researchers, developers, and commercial entities. ivrit.ai opens up numerous applications, offering vast potential to enhance AI capabilities in Hebrew. Future efforts aim to expand ivrit.ai further, thereby advancing Hebrew's standing in AI research and technology.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
