Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

Huan Zheng; Yucheng Zhou; Tianyi Yan; Dubing Chen; Hongbo Lu; Wenlong Liao; Tao He; Pai Peng; Jianbing Shen

arXiv:2603.20698·cs.CV·April 10, 2026

Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

Huan Zheng, Yucheng Zhou, Tianyi Yan, Dubing Chen, Hongbo Lu, Wenlong Liao, Tao He, Pai Peng, Jianbing Shen

PDF

TL;DR

This paper introduces a clinical-cognition-aligned framework for multimodal large language models in gastrointestinal diagnosis, improving causal reasoning and diagnostic accuracy by integrating hierarchical clinical logic and counterfactual reinforcement learning.

Contribution

It proposes a novel framework combining hierarchical clinical cognition modeling and causal rectification via counterfactual reinforcement learning for better medical diagnosis.

Findings

01

Achieves state-of-the-art performance on multiple benchmarks.

02

Significantly improves diagnostic accuracy in complex scenarios.

03

Effectively grounds diagnosis in causal lesion features.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable potential in medical image analysis. However, their application in gastrointestinal endoscopy is currently hindered by two critical limitations: the misalignment between general model reasoning and standardized clinical cognitive pathways, and the lack of causal association between visual features and diagnostic outcomes. In this paper, we propose a novel Clinical-Cognitive-Aligned (CogAlign) framework to address these challenges. First, we endow the model with rigorous clinical analytical capabilities by constructing the hierarchical clinical cognition dataset and employing Supervised Fine-Tuning (SFT). Unlike conventional approaches, this strategy internalizes the hierarchical diagnostic logic of experts, ranging from anatomical localization and morphological evaluation to microvascular analysis, directly into the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.