SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource   Languages

Gayane Ghazaryan; Erik Arakelyan; Pasquale Minervini; Isabelle; Augenstein

arXiv:2406.14425·cs.CL·September 18, 2024

SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages

Gayane Ghazaryan, Erik Arakelyan, Pasquale Minervini, Isabelle, Augenstein

PDF

Open Access 1 Datasets

TL;DR

This paper introduces SynDARin, a novel method for generating high-quality QA datasets in low-resource languages using parallel content mining, synthetic question generation, and validation, demonstrated with Armenian data.

Contribution

SynDARin provides an automated approach to create and validate multilingual QA datasets, reducing costs and enabling evaluation of LLMs in low-resource languages.

Findings

01

Generated Armenian QA dataset with 1.2K samples.

02

98% of English questions maintained quality and diversity.

03

State-of-the-art LLMs perform poorly on the dataset.

Abstract

Question Answering (QA) datasets have been instrumental in developing and evaluating Large Language Model (LLM) capabilities. However, such datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation. This means that producing novel models and measuring the performance of multilingual LLMs in low-resource languages is challenging. To mitigate this, we propose $S$ yn $DAR$ in, a method for generating and validating QA datasets for low-resource languages. We utilize parallel content mining to obtain $human-curated$ paragraphs between English and the target language. We use the English data as context to $generate$ synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

gayaneghazaryan/SynDARin
dataset· 7 dl
7 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies