CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset
Jind\v{r}ich Libovick\'y, Jind\v{r}ich Helcl, Andrei Manea, Gianluca Vico

TL;DR
The paper introduces CUS-QA, a multilingual, multimodal dataset for regional open-ended question answering, and evaluates large language models and evaluation metrics on this benchmark.
Contribution
It presents a new dataset for regional open-ended QA with textual and visual questions, and analyzes LLM performance and evaluation metrics.
Findings
LLMs achieve over 40% accuracy on textual questions
LLMs achieve below 30% accuracy on visual questions
Evaluation metrics correlate well with human judgments
Abstract
We introduce CUS-QA, a benchmark for evaluation of open-ended regional question answering that encompasses both textual and visual modalities. We also provide strong baselines using state-of-the-art large language models (LLMs). Our dataset consists of manually curated questions and answers grounded in Wikipedia, created by native speakers from Czechia, Slovakia, and Ukraine, with accompanying English translations. It includes both purely textual questions and those requiring visual understanding. We evaluate state-of-the-art LLMs through prompting and add human judgments of answer correctness. Using these human evaluations, we analyze the reliability of existing automatic evaluation metrics. Our baseline results show that even the best open-weight LLMs achieve only over 40% accuracy on textual questions and below 30% on visual questions. LLM-based evaluation metrics show strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
