ParaShoot: A Hebrew Question Answering Dataset
Omri Keren, Omer Levy

TL;DR
ParaShoot introduces the first Hebrew question answering dataset, enabling progress in Hebrew NLP by providing annotated examples and baseline results with BERT-style models, highlighting the need for further improvements.
Contribution
It presents the first Hebrew QA dataset with 3000 examples, following SQuAD format, and provides initial baseline results using BERT-style models.
Findings
Baseline models perform significantly below optimal levels.
The dataset fills a crucial gap in Hebrew NLP resources.
Room for improvement indicates potential for future research.
Abstract
NLP research in Hebrew has largely focused on morphology and syntax, where rich annotated datasets in the spirit of Universal Dependencies are available. Semantic datasets, however, are in short supply, hindering crucial advances in the development of NLP technology in Hebrew. In this work, we present ParaShoot, the first question answering dataset in modern Hebrew. The dataset follows the format and crowdsourcing methodology of SQuAD, and contains approximately 3000 annotated examples, similar to other question-answering datasets in low-resource languages. We provide the first baseline results using recently-released BERT-style models for Hebrew, showing that there is significant room for improvement on this task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
