Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation

Jong Hak Moon; Geon Choi; Paloma Rabaey; Min Gwan Kim; Jung-Oh Lee; Hyuk Gi Hong; Eun Woo Doe; Hangyul Yoon; Jiyoun Kim; Harshita Sharma; Daniel C. Castro; Javier Alvarez-Valle; and Edward Choi

arXiv:2505.21190·cs.CL·April 30, 2026

Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation

Jong Hak Moon, Geon Choi, Paloma Rabaey, Min Gwan Kim, Jung-Oh Lee, Hyuk Gi Hong, Eun Woo Doe, Hangyul Yoon, Jiyoun Kim, Harshita Sharma, Daniel C. Castro, Javier Alvarez-Valle, and Edward Choi

PDF

1 Repo 1 Datasets

TL;DR

LUNGUAGE is a new benchmark dataset and evaluation framework for structured, longitudinal chest X-ray report generation, capturing disease progression over time with expert annotations and a novel scoring metric.

Contribution

Introduces the first benchmark, structuring framework, and evaluation metric for sequential radiology report generation and assessment.

Findings

01

LUNGUAGESCORE effectively evaluates structured report quality.

02

The benchmark supports both single-report and longitudinal assessments.

03

Empirical results show the scoring metric aligns well with expert judgments.

Abstract

Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE, a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 186 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage structuring framework that transforms generated reports into fine-grained, schema-aligned structured reports, enabling longitudinal interpretation. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SuperSupermoon/Lunguage
github

Datasets

SuperSupermoon/Lunguage
dataset· 22 dl
22 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.