TL;DR
LiteMedCoT-VL is a parameter-efficient model that transfers multi-step reasoning capabilities from a large teacher to a compact student for medical visual question answering, achieving high accuracy without image captions.
Contribution
It introduces a LoRA-based fine-tuning pipeline that distills reasoning chains from a large model to a smaller one for medical VQA tasks.
Findings
Achieves 64.9% accuracy on PMC-VQA benchmark, surpassing larger models.
Outperforms all published baselines in medical VQA.
Relies on image content rather than textual priors during inference.
Abstract
The reasoning gap between large and compact vision-language models (VLMs) limits the deployment of medical AI on portable clinical devices. Compact VLMs of 2--4B parameters can run on resource-constrained hardware but lack the multi-step reasoning capacity needed for interpretable clinical decision support. Existing knowledge distillation methods transfer answers without the reasoning process behind them. Medical visual question answering (VQA) serves as a testbed for this problem, as it requires models to integrate visual evidence with clinical knowledge through structured reasoning chains. We introduce LiteMedCoT-VL, a pipeline that transfers chain-of-thought reasoning from a 235B teacher model to 2B student models through LoRA-based fine-tuning on explanation-enriched training data. All inference is conducted without image captions by default, simulating the clinical scenario in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
