See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis

Ruinan Jin; Gexin Huang; Xinwei Shen; Qiong Zhang; Yan Shuo Tan; Xiaoxiao Li

arXiv:2506.18140·cs.CV·February 24, 2026

See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis

Ruinan Jin, Gexin Huang, Xinwei Shen, Qiong Zhang, Yan Shuo Tan, Xiaoxiao Li

PDF

TL;DR

This paper demonstrates that incorporating healthy reference images and comparative prompts into vision-language models significantly improves medical diagnosis accuracy, with benefits from lightweight fine-tuning and various reference selection strategies.

Contribution

It introduces a reference image-guided approach for medical VLMs, enhancing diagnostic performance through comparative analysis and practical reference selection methods.

Findings

01

Improved diagnostic accuracy with reference images and prompts

02

Lightweight fine-tuning further boosts performance

03

Consistent results across different reference selection strategies

Abstract

Medical image diagnosis is challenging because many diseases resemble normal anatomy and exhibit substantial interpatient variability. Clinicians routinely rely on comparative diagnosis, such as referencing cross-patient healthy control images to identify subtle but clinically meaningful abnormalities. Although healthy reference images are abundant in practice, existing medical vision-language models (VLMs) primarily operate in a single-image or single-series setting and lack explicit mechanisms for comparative diagnosis. This work investigates whether incorporating clinically motivated comparison can enhance VLM performance. We show that providing VLMs with both a query image and a matched healthy reference image, accompanied by cross-patient comparative prompts, significantly improves diagnostic performance. This performance can be further augmented by lightweight supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.