A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian

Ana-Cristina Rogoz; Radu Tudor Ionescu; Alexandra-Valentina Anghel; Ionut-Lucian Antone-Iordache; Simona Coniac; Andreea Iuliana Ionescu

arXiv:2508.16390·cs.CL·February 13, 2026

A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian

Ana-Cristina Rogoz, Radu Tudor Ionescu, Alexandra-Valentina Anghel, Ionut-Lucian Antone-Iordache, Simona Coniac, Andreea Iuliana Ionescu

PDF

1 Datasets

TL;DR

This paper introduces MedQARo, a large-scale Romanian medical question-answering benchmark with 105,880 QA pairs, evaluating various large language models' ability to generalize and perform in clinical contexts.

Contribution

It presents the first comprehensive Romanian medical QA benchmark and evaluates multiple LLMs, highlighting the importance of domain-specific fine-tuning for clinical applications.

Findings

01

Fine-tuned models outperform zero-shot models significantly.

02

Pretrained models struggle to generalize on MedQARo.

03

Domain and language-specific fine-tuning are crucial for reliable clinical QA.

Abstract

We introduce MedQARo, the first large-scale medical QA benchmark in Romanian, alongside a comprehensive evaluation of state-of-the-art large language models (LLMs). We construct a high-quality and large-scale dataset comprising 105,880 QA pairs about cancer patients from two medical centers. The questions regard medical case summaries of 1,242 patients, requiring both keyword extraction and reasoning. Our benchmark contains both in-domain and cross-domain (cross-center and cross-cancer) test collections, enabling a precise assessment of generalization capabilities. We experiment with four open-source LLMs from distinct families of models on MedQARo. Each model is employed in two scenarios: zero-shot prompting and supervised fine-tuning. We also evaluate two state-of-the-art LLMs exposed only through APIs, namely GPT-5.2 and Gemini 3 Flash. Our results show that fine-tuned models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

mariocedo/GlobalMedQA
dataset· 13 dl
13 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.