Evaluating Causal Explanation in Medical Reports with LLM-Based and Human-Aligned Metrics

Yousang Cho; Key-Sun Choi

arXiv:2506.18387·cs.CL·June 24, 2025

Evaluating Causal Explanation in Medical Reports with LLM-Based and Human-Aligned Metrics

Yousang Cho, Key-Sun Choi

PDF

TL;DR

This paper evaluates how well various automatic metrics, including LLM-based and traditional ones, measure the quality of causal explanations in medical reports, highlighting GPT-Black's superior performance.

Contribution

It compares multiple evaluation metrics for causal explanations in medical reports and demonstrates the effectiveness of LLM-based metrics like GPT-Black and GPT-White.

Findings

01

GPT-Black best discriminates coherent causal narratives

02

GPT-White aligns well with expert assessments

03

Similarity metrics diverge from clinical reasoning quality

Abstract

This study investigates how accurately different evaluation metrics capture the quality of causal explanations in automatically generated diagnostic reports. We compare six metrics: BERTScore, Cosine Similarity, BioSentVec, GPT-White, GPT-Black, and expert qualitative assessment across two input types: observation-based and multiple-choice-based report generation. Two weighting strategies are applied: one reflecting task-specific priorities, and the other assigning equal weights to all metrics. Our results show that GPT-Black demonstrates the strongest discriminative power in identifying logically coherent and clinically valid causal narratives. GPT-White also aligns well with expert evaluations, while similarity-based metrics diverge from clinical reasoning quality. These findings emphasize the impact of metric selection and weighting on evaluation outcomes, supporting the use of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.