ParaShoot: A Hebrew Question Answering Dataset

Omri Keren; Omer Levy

arXiv:2109.11314·cs.CL·September 24, 2021

ParaShoot: A Hebrew Question Answering Dataset

Omri Keren, Omer Levy

PDF

1 Repo 2 Models 1 Datasets

TL;DR

ParaShoot introduces the first Hebrew question answering dataset, enabling progress in Hebrew NLP by providing annotated examples and baseline results with BERT-style models, highlighting the need for further improvements.

Contribution

It presents the first Hebrew QA dataset with 3000 examples, following SQuAD format, and provides initial baseline results using BERT-style models.

Findings

01

Baseline models perform significantly below optimal levels.

02

The dataset fills a crucial gap in Hebrew NLP resources.

03

Room for improvement indicates potential for future research.

Abstract

NLP research in Hebrew has largely focused on morphology and syntax, where rich annotated datasets in the spirit of Universal Dependencies are available. Semantic datasets, however, are in short supply, hindering crucial advances in the development of NLP technology in Hebrew. In this work, we present ParaShoot, the first question answering dataset in modern Hebrew. The dataset follows the format and crowdsourcing methodology of SQuAD, and contains approximately 3000 annotated examples, similar to other question-answering datasets in low-resource languages. We provide the first baseline results using recently-released BERT-style models for Hebrew, showing that there is significant room for improvement on this task.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

omrikeren/parashoot
noneOfficial

Models

Datasets

imvladikon/parashoot
dataset· 73 dl
73 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.