Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction
Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra, Domoga{\l}a, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz

TL;DR
This paper introduces PUGG, the first Polish KBQA dataset, developed using a modern semi-automated pipeline that leverages LLMs to efficiently create datasets for low-resource languages, covering KBQA, MRC, and IR tasks.
Contribution
The paper presents a novel semi-automated dataset construction pipeline utilizing LLMs, specifically designed for low-resource languages, and introduces the first Polish KBQA, MRC, and IR datasets.
Findings
Successfully created the PUGG dataset for Polish KBQA.
Provided baseline evaluations and detailed statistics for the datasets.
Demonstrated the efficiency of the semi-automated pipeline in low-resource settings.
Abstract
Advancements in AI and natural language processing have revolutionized machine-human language interactions, with question answering (QA) systems playing a pivotal role. The knowledge base question answering (KBQA) task, utilizing structured knowledge graphs (KG), allows for handling extensive knowledge-intensive questions. However, a significant gap exists in KBQA datasets, especially for low-resource languages. Many existing construction pipelines for these datasets are outdated and inefficient in human labor, and modern assisting tools like Large Language Models (LLM) are not utilized to reduce the workload. To address this, we have designed and implemented a modern, semi-automated approach for creating datasets, encompassing tasks such as KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR), tailored explicitly for low-resource environments. We executed this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsBalanced Selection
