SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages
Gayane Ghazaryan, Erik Arakelyan, Pasquale Minervini, Isabelle, Augenstein

TL;DR
This paper introduces SynDARin, a novel method for generating high-quality QA datasets in low-resource languages using parallel content mining, synthetic question generation, and validation, demonstrated with Armenian data.
Contribution
SynDARin provides an automated approach to create and validate multilingual QA datasets, reducing costs and enabling evaluation of LLMs in low-resource languages.
Findings
Generated Armenian QA dataset with 1.2K samples.
98% of English questions maintained quality and diversity.
State-of-the-art LLMs perform poorly on the dataset.
Abstract
Question Answering (QA) datasets have been instrumental in developing and evaluating Large Language Model (LLM) capabilities. However, such datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation. This means that producing novel models and measuring the performance of multilingual LLMs in low-resource languages is challenging. To mitigate this, we propose ynin, a method for generating and validating QA datasets for low-resource languages. We utilize parallel content mining to obtain paragraphs between English and the target language. We use the English data as context to synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
