MedCEG: Reinforcing Verifiable Medical Reasoning with Critical Evidence Graph

Linjie Mu; Yannian Gu; Zhongzhen Huang; Yakun Zhu; Shaoting Zhang; Xiaofan Zhang

arXiv:2512.13510·cs.AI·December 16, 2025

MedCEG: Reinforcing Verifiable Medical Reasoning with Critical Evidence Graph

Linjie Mu, Yannian Gu, Zhongzhen Huang, Yakun Zhu, Shaoting Zhang, Xiaofan Zhang

PDF

Open Access 4 Reviews

TL;DR

MedCEG enhances medical language models by explicitly supervising their reasoning with verifiable evidence graphs, improving clinical validity and reasoning quality in medical AI applications.

Contribution

This paper introduces MedCEG, a novel framework that incorporates a Critical Evidence Graph and a specialized reward to improve the clinical validity of reasoning in medical language models.

Findings

01

MedCEG outperforms existing methods in reasoning accuracy.

02

MedCEG produces clinically valid and verifiable reasoning chains.

03

The framework significantly enhances the reliability of medical AI reasoning.

Abstract

Large language models with reasoning capabilities have demonstrated impressive performance across a wide range of domains. In clinical applications, a transparent, step-by-step reasoning process provides physicians with strong evidence to support decision-making. While reinforcement learning has effectively enhanced reasoning performance in medical contexts, the clinical reliability of these reasoning processes remains limited because their accuracy and validity are often overlooked during training. To address this gap, we propose MedCEG, a framework that augments medical language models with clinically valid reasoning pathways by explicitly supervising the reasoning process through a Critical Evidence Graph (CEG). We curate a dataset of challenging clinical cases and algorithmically construct a CEG for each sample to represent a high-quality verifiable reasoning pathway. To guide the…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

- Novel use of CEG (minimal, causally connected subgraph) as a direct, non-learned reward signal. - Useful composite process reward (node/structure/chain) with informative ablations. - Release of a CEG dataset may benefit the community.

Weaknesses

- Heavy reliance on LLMs for EG/CEG construction, rationale parsing, answer judging, and process scoring risks bias/circularity. - Possible train–test overlap: training corpus is curated from benchmarks also used for evaluation; deduplication details are limited. - Reward sensitivity: semantic matching thresholds and embedding-based node coverage may over-credit near-synonyms; limited analysis. - Chain Completeness computed on an undirected graph; directed path validity may be more appropriate f

Reviewer 02Rating 4Confidence 5

Strengths

The authors proposed a graph-based reward function to guide the reasoning process. The reward design is based on the basic concept of graph, which makes sense.

Weaknesses

1. The authors described their work as verifiable medical reasoning, but this concept is not clear to me. From what I understand, the method aligns the LLM’s reasoning process with the Critical Evidence Graph (CEG). However, there is no guarantee that the CEG itself is correct or based on accurate medical knowledge. In my view, verifiable medical reasoning should mean that the model reasons based on verified external knowledge sources. Since the model does not interact with any reliable knowledg

Reviewer 03Rating 2Confidence 5

Strengths

* The authors construct and release a 10K clinical case dataset with structured reasoning graphs, likely valuable for future research on explainable medical LLMs.

Weaknesses

**Major limitation: mischaracterization of SOTA performance of prior work**: The Results section is missing crucial studies that mislead the reader to believe the contribution of this work achieves SOTA performance when it does not. Whether this is intentional or not, it needs to be corrected. For instance, for each benchmark, it is unclear why the best-performing models were not reported. In MedQA, GPT-4 achieves 90% accuracy, while MedGemini reaches 91%. However, the best performance reported

Reviewer 04Rating 4Confidence 5

Strengths

1. The paper is well-structured and clearly written, making it easy to follow. 2. The focus on medical reasoning, aiming for both reliable reasoning processes and correct answers, is meaningful and relevant for advancing LLM performance on clinical tasks. 3. Combining reasoning chains and graph structures as a reward signal enhances both accuracy and reasoning quality, as evidenced by detailed results in several medical question datasets.

Weaknesses

1. The entire pipeline heavily depends on several powerful LLMs for generating reasoning processes, constructing graphs, computing rewards, and even performing model evaluation (via LLM-as-a-judge). None of these steps involves natural linkages or expert verification, which raises concerns about cumulative noise and inherited bias. 2. The evaluation design is largely self-referential. GPT-OSS-120B is used to generate reasoning traces for SFT training in the cold-start stage, and the same family

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Topic Modeling