Evaluating Causal Explanation in Medical Reports with LLM-Based and Human-Aligned Metrics
Yousang Cho, Key-Sun Choi

TL;DR
This paper evaluates how well various automatic metrics, including LLM-based and traditional ones, measure the quality of causal explanations in medical reports, highlighting GPT-Black's superior performance.
Contribution
It compares multiple evaluation metrics for causal explanations in medical reports and demonstrates the effectiveness of LLM-based metrics like GPT-Black and GPT-White.
Findings
GPT-Black best discriminates coherent causal narratives
GPT-White aligns well with expert assessments
Similarity metrics diverge from clinical reasoning quality
Abstract
This study investigates how accurately different evaluation metrics capture the quality of causal explanations in automatically generated diagnostic reports. We compare six metrics: BERTScore, Cosine Similarity, BioSentVec, GPT-White, GPT-Black, and expert qualitative assessment across two input types: observation-based and multiple-choice-based report generation. Two weighting strategies are applied: one reflecting task-specific priorities, and the other assigning equal weights to all metrics. Our results show that GPT-Black demonstrates the strongest discriminative power in identifying logically coherent and clinically valid causal narratives. GPT-White also aligns well with expert evaluations, while similarity-based metrics diverge from clinical reasoning quality. These findings emphasize the impact of metric selection and weighting on evaluation outcomes, supporting the use of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
