MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing

Wenjie Wang; Wei Wu; Ying Liu; Yuan Zhao; Xiaole Lv; Liang Diao; Zengjian Fan; Wenfeng Xie; Ziling Lin; De Shi; Lin Huang; Kaihe Xu; Hong Li

arXiv:2602.06402·cs.CV·February 9, 2026

MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing

Wenjie Wang, Wei Wu, Ying Liu, Yuan Zhao, Xiaole Lv, Liang Diao, Zengjian Fan, Wenfeng Xie, Ziling Lin, De Shi, Lin Huang, Kaihe Xu, Hong Li

PDF

Open Access

TL;DR

MeDocVL is a specialized vision-language model designed to accurately parse complex medical documents, overcoming noise and layout challenges through innovative training strategies, and outperforming existing OCR and VLM methods.

Contribution

The paper introduces MeDocVL, a novel post-trained vision-language model with training-driven label refinement and noise-aware hybrid post-training for medical document parsing.

Findings

01

Outperforms conventional OCR systems on medical invoice benchmarks.

02

Achieves state-of-the-art accuracy under noisy supervision.

03

Demonstrates robustness to complex layouts and noisy annotations.

Abstract

Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Topic Modeling