SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents

Jaehoon Lee; Sohyun Kim; Wanggeun Park; Geon Lee; Seungkyung Kim; Minyoung Lee

arXiv:2511.04910·cs.CL·November 11, 2025

SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents

Jaehoon Lee, Sohyun Kim, Wanggeun Park, Geon Lee, Seungkyung Kim, Minyoung Lee

PDF

Open Access 1 Datasets

TL;DR

This paper introduces SDS KoPub VDR, a large-scale benchmark dataset for visual document retrieval in Korean public documents, addressing language and structural complexity gaps in existing VDR benchmarks.

Contribution

It provides the first comprehensive Korean public document dataset with multimodal queries and human-verified annotations for evaluating VDR models.

Findings

01

Significant performance gaps in multimodal retrieval tasks.

02

State-of-the-art models struggle with cross-modal reasoning.

03

The dataset enables detailed evaluation of document understanding models.

Abstract

Existing benchmarks for visual document retrieval (VDR) largely overlook non-English languages and the structural complexity of official publications. To address this gap, we introduce SDS KoPub VDR, the first large-scale, public benchmark for retrieving and understanding Korean public documents. The benchmark is built upon 361 real-world documents, including 256 files under the KOGL Type 1 license and 105 from official legal portals, capturing complex visual elements like tables, charts, and multi-column layouts. To establish a reliable evaluation set, we constructed 600 query-page-answer triples. These were initially generated using multimodal models (e.g., GPT-4o) and subsequently underwent human verification to ensure factual accuracy and contextual relevance. The queries span six major public domains and are categorized by the reasoning modality required: text-based, visual-based,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

SamsungSDS-Research/SDS-KoPub-VDR-Benchmark
dataset· 179 dl
179 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Handwritten Text Recognition Techniques