EuraGovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams
JaeSeong Kim, Chaehwan Lim, Sang Hyun Gil, Suan Lee

TL;DR
EuraGovExam is a challenging multilingual, multimodal benchmark from real civil service exams, testing models' ability to perform layout-aware, cross-lingual reasoning on complex visual documents.
Contribution
It introduces a realistic, high-complexity dataset with embedded visual and multilingual content, pushing the limits of current vision-language models in practical, high-stakes scenarios.
Findings
State-of-the-art VLMs achieve only 86% accuracy on the benchmark.
The dataset's visual and linguistic complexity reveals limitations of current models.
EuraGovExam sets a new standard for evaluating models in real-world, multilingual, image-grounded tasks.
Abstract
We present EuraGovExam, a multilingual and multimodal benchmark sourced from real-world civil service examinations across five representative Eurasian regions: South Korea, Japan, Taiwan, India, and the European Union. Designed to reflect the authentic complexity of public-sector assessments, the dataset contains over 8,000 high-resolution scanned multiple-choice questions covering 17 diverse academic and administrative domains. Unlike existing benchmarks, EuraGovExam embeds all question content--including problem statements, answer choices, and visual elements--within a single image, providing only a minimal standardized instruction for answer formatting. This design demands that models perform layout-aware, cross-lingual reasoning directly from visual input. All items are drawn from real exam documents, preserving rich visual structures such as tables, multilingual typography, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
