FoQA: A Faroese Question-Answering Dataset

Annika Simonsen; Dan Saattrup Nielsen; Hafsteinn Einarsson

arXiv:2502.07642·cs.CL·February 12, 2025

FoQA: A Faroese Question-Answering Dataset

Annika Simonsen, Dan Saattrup Nielsen, Hafsteinn Einarsson

PDF

Open Access 1 Datasets

TL;DR

FoQA is a newly created Faroese extractive question-answering dataset with 2,000 validated samples, generated through a semi-automated process involving GPT-4-turbo and human validation, enabling evaluation of QA models in Faroese.

Contribution

This paper introduces FoQA, the first Faroese QA dataset, combining LLMs and human validation to ensure quality, and provides baseline performance metrics for various models.

Findings

01

FoQA contains 2,000 validated samples for Faroese QA evaluation.

02

Baseline models like LLMs and BERT show varying performance on FoQA.

03

The dataset includes additional generated and rejected samples for comprehensive analysis.

Abstract

We present FoQA, a Faroese extractive question-answering (QA) dataset with 2,000 samples, created using a semi-automated approach combining Large Language Models (LLMs) and human validation. The dataset was generated from Faroese Wikipedia articles using GPT-4-turbo for initial QA generation, followed by question rephrasing to increase complexity and native speaker validation to ensure quality. We provide baseline performance metrics for FoQA across multiple models, including LLMs and BERT, demonstrating its effectiveness in evaluating Faroese QA performance. The dataset is released in three versions: a validated set of 2,000 samples, a complete set of all 10,001 generated samples, and a set of 2,395 rejected samples for error analysis.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

alexandrainst/foqa
dataset· 50 dl
50 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Adam · Softmax · Linear Warmup With Linear Decay · Dropout · Weight Decay · WordPiece · Attention Dropout · Layer Normalization