HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models

Songtao Jiang; Yan Zhang; Yeying Jin; Zhihang Tang; Yangyang Wu; Yang Feng; Jian Wu; Zuozhu Liu

arXiv:2506.00805·cs.CV·June 3, 2025

HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models

Songtao Jiang, Yan Zhang, Yeying Jin, Zhihang Tang, Yangyang Wu, Yang Feng, Jian Wu, Zuozhu Liu

PDF

Open Access 1 Video

TL;DR

This paper introduces HSCR, a novel hierarchical self-contrastive rewarding method that improves alignment and trustworthiness of medical vision-language models by generating high-quality preference data and capturing nuanced preferences.

Contribution

HSCR presents a cost-effective way to generate preference data and a multi-level optimization strategy for better modality alignment in Med-VLMs.

Findings

01

Enhanced zero-shot performance across medical tasks

02

Significant improvement in modality alignment and trustworthiness

03

Effective with only 2,000 training entries

Abstract

Medical Vision-Language Models (Med-VLMs) have achieved success across various tasks, yet most existing methods overlook the modality misalignment issue that can lead to untrustworthy responses in clinical settings. In this paper, we propose Hierarchical Self-Contrastive Rewarding (HSCR), a novel approach that addresses two critical challenges in Med-VLM alignment: 1) Cost-effective generation of high-quality preference data; 2) Capturing nuanced and context-aware preferences for improved alignment. HSCR first leverages the inherent capability of Med-VLMs to generate dispreferred responses with higher sampling probability. By analyzing output logit shifts after visual token dropout, we identify modality-coupled tokens that induce misalignment and derive an implicit alignment reward function. This function guides token replacement with hallucinated ones during decoding, producing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques