TL;DR
This paper introduces a Wasserstein-based semantic stopping criterion for game-theoretic decoding in medical visual question answering, improving accuracy and efficiency of small vision-language models.
Contribution
It extends game-theoretic decoding to vision-language models with a novel Wasserstein criterion that enhances semantic consensus and reduces unnecessary iterations.
Findings
Achieved +3.5% accuracy improvement on VQA-RAD with Qwen3-VL-2B.
Matched MedGemma-4B performance without domain-specific fine-tuning.
Reduced average convergence iterations by 20%, improving inference efficiency.
Abstract
Small vision-language models (2-8B) are well-suited for clin- ical deployment due to privacy constraints, limited connectivity, and low-latency requirements favouring on-device or on-premise inference. However, their limited capacity exacerbates the generation of plausible but incorrect outputs. We extend game-theoretic decoding, previously restricted to text-only, closed-ended NLP tasks, to vision-language mod- els for open-ended Medical VQA. We introduce a semantically aware Wasserstein stopping criterion that replaces lexical order matching, en- abling convergence based on semantic consensus among near-synonymous candidate answers and avoiding unnecessary iterations caused by clini- cally equivalent ranking swaps. On VQA-RAD and PathVQA, we ob- tain consistent, statistically significant improvements over greedy and discriminative baselines. On VQA-RAD, we improve Qwen3-VL-2B by +3.5…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
