Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs

Xiaoke Huang; Ningsen Wang; Hui Liu; Xianfeng Tang; Yuyin Zhou

arXiv:2510.25867·cs.LG·February 19, 2026

Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs

Xiaoke Huang, Ningsen Wang, Hui Liu, Xianfeng Tang, Yuyin Zhou

PDF

10 Models 3 Reviews

TL;DR

This paper introduces MedVLSynther, a framework that synthesizes high-quality medical VQA data from biomedical literature using generator-verifier models, enabling improved training of open-weight LMMs for medical question answering.

Contribution

The authors develop a novel generator-verifier pipeline to create large, high-quality medical VQA datasets from open literature, enhancing model training and evaluation.

Findings

01

Achieved state-of-the-art accuracy on multiple medical VQA benchmarks.

02

Generated over 13,000 verified medical VQA questions from open biomedical literature.

03

Demonstrated the necessity of both generation and verification stages for high-quality data.

Abstract

Large Multimodal Models (LMMs) are increasingly capable of answering medical questions that require joint reasoning over images and text, yet training general medical VQA systems is impeded by the lack of large, openly usable, high-quality corpora. We present MedVLSynther, a rubric-guided generator-verifier framework that synthesizes high-quality multiple-choice VQA items directly from open biomedical literature by conditioning on figures, captions, and in-text references. The generator produces self-contained stems and parallel, mutually exclusive options under a machine-checkable JSON schema; a multi-stage verifier enforces essential gates (self-containment, single correct answer, clinical validity, image-text consistency), awards fine-grained positive points, and penalizes common failure modes before acceptance. Applying this pipeline to PubMed Central yields MedSynVQA: 13,087…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 5

Strengths

1. Provides a new and useful medical VQA dataset. 2. Carefully analyzes the data generation pipeline. 3. Conducts large-scale evaluation showing the dataset’s usefulness.

Weaknesses

Main concerns: My main concern falls in two aspects: 1. The improvement is limited. As shown in Table 2, most gains appear to come from using PMC figures, while the additional rubric-based filtering contributes little to performance improvement. Can authors explain the results more? 2. The paper lacks a direct evaluation of data quality. All reported results rely on downstream task performance rather than explicitly measuring the dataset’s intrinsic quality. It is strongly recommended that the

Reviewer 02Rating 2Confidence 4

Strengths

- The author proposed a first fully open pipeline to create synthetic medical VQA dataset. From the dataset, model, to the generation and verification process are transparent, which can be easily customized to various setting and is beneficial to the research community. - A deep investigation of the generator and verifier are presented in the paper, which shed light on how people can leverage the proposed pipeline.

Weaknesses

In summary, my concerns are more related to the resulting data evaluation: - Potential data leakage: Although we cannot say much about the other open models compared in the baseline, it seems to me that the resulting MedVLSynther has data leakage issue when evaluated on PMC benchmark because Biomedica is dereived from PMC. - Inconsistency and confusing table and graph results: The numbers of MedVLThinker-3B/7B presented in Figure 1c does not match those in Table 6. Feel like the numbers in Fig

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper introduces a generator-verifier architecture that separates generation and verification using different LMMs, enabling auditable quality control through a rigorous three-stage process of essential gates, fine-grained scoring, and penalty detection. 2. The context-aware generation approach incorporates not just image captions but also in-text reference paragraphs from the surrounding literature. 3. The comprehensive rubric design with seven essential criteria, four to eight fine-g

Weaknesses

This paper is technically competent and presents a well-engineered pipeline. However, I have concerns about whether it makes a sufficient scientific contribution to a top-tier venue. Apart from the proposed datasets, what are the main insights people can get from this paper? 1. The paper positions itself as achieving high quality despite a small scale (13K samples), but lacks a direct empirical comparison with larger-scale alternatives. It does not have direct comparison using PMC-VQA's full 22

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.