CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset

Jind\v{r}ich Libovick\'y; Jind\v{r}ich Helcl; Andrei Manea; Gianluca Vico

arXiv:2507.22752·cs.CL·February 3, 2026

CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset

Jind\v{r}ich Libovick\'y, Jind\v{r}ich Helcl, Andrei Manea, Gianluca Vico

PDF

1 Datasets

TL;DR

The paper introduces CUS-QA, a multilingual, multimodal dataset for regional open-ended question answering, and evaluates large language models and evaluation metrics on this benchmark.

Contribution

It presents a new dataset for regional open-ended QA with textual and visual questions, and analyzes LLM performance and evaluation metrics.

Findings

01

LLMs achieve over 40% accuracy on textual questions

02

LLMs achieve below 30% accuracy on visual questions

03

Evaluation metrics correlate well with human judgments

Abstract

We introduce CUS-QA, a benchmark for evaluation of open-ended regional question answering that encompasses both textual and visual modalities. We also provide strong baselines using state-of-the-art large language models (LLMs). Our dataset consists of manually curated questions and answers grounded in Wikipedia, created by native speakers from Czechia, Slovakia, and Ukraine, with accompanying English translations. It includes both purely textual questions and those requiring visual understanding. We evaluate state-of-the-art LLMs through prompting and add human judgments of answer correctness. Using these human evaluations, we analyze the reliability of existing automatic evaluation metrics. Our baseline results show that even the best open-weight LLMs achieve only over 40% accuracy on textual questions and below 30% on visual questions. LLM-based evaluation metrics show strong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ufal/cus-qa
dataset· 76 dl
76 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.