MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning

Siyong Chen; Jinbo Wen; Jiawen Kang; Tenghui Huang; Xumin Huang; Yuanjia Su; Hudan Pan; Zishao Zhong; Dusit Niyato; Shengli Xie; and Dong In Kim

arXiv:2510.21093·cs.AI·October 27, 2025

MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning

Siyong Chen, Jinbo Wen, Jiawen Kang, Tenghui Huang, Xumin Huang, Yuanjia Su, Hudan Pan, Zishao Zhong, Dusit Niyato, Shengli Xie, and Dong In Kim

PDF

Open Access

TL;DR

MedAlign is a novel framework that enhances medical visual question answering by aligning preferences with visual data, routing queries to specialized experts, and enabling federated, adaptive reasoning to improve accuracy and efficiency.

Contribution

This paper introduces MedAlign, integrating multimodal preference optimization, retrieval-aware expert routing, and federated meta-cognitive reasoning for improved clinical LVLM performance.

Findings

01

Achieves up to 11.85% F1-score improvement over baselines

02

Reduces reasoning length by 51.60% compared to fixed-depth methods

03

Demonstrates effectiveness across three Med-VQA datasets

Abstract

Recently, large models have shown significant potential for smart healthcare. However, the deployment of Large Vision-Language Models (LVLMs) for clinical services is currently hindered by three critical challenges: a tendency to hallucinate answers not grounded in visual evidence, the inefficiency of fixed-depth reasoning, and the difficulty of multi-institutional collaboration. To address these challenges, in this paper, we develop MedAlign, a novel framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA). Specifically, we first propose a multimodal Direct Preference Optimization (mDPO) objective to explicitly align preference learning with visual context. We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM (i.e., an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning