From Prompt Optimization to Multi-Dimensional Credibility Evaluation: Enhancing Trustworthiness of Chinese LLM-Generated Liver MRI Reports
Qiuli Wang, Jie Chen, Yongxu Liu, Xingpeng Zhang, Xiaoming Li, Wei Chen

TL;DR
This paper introduces a Multi-Dimensional Credibility Assessment framework to evaluate and improve the trustworthiness of Chinese LLM-generated liver MRI reports, addressing prompt optimization and standardized assessment.
Contribution
It presents a novel framework for assessing LLM report credibility and offers guidance for prompt optimization in clinical radiology contexts.
Findings
The framework effectively evaluates multiple LLMs' report quality.
Institution-specific prompt optimization improves report trustworthiness.
The study compares several advanced LLMs using the proposed assessment method.
Abstract
Large language models (LLMs) have demonstrated promising performance in generating diagnostic conclusions from imaging findings, thereby supporting radiology reporting, trainee education, and quality control. However, systematic guidance on how to optimize prompt design across different clinical contexts remains underexplored. Moreover, a comprehensive and standardized framework for assessing the trustworthiness of LLM-generated radiology reports is yet to be established. This study aims to enhance the trustworthiness of LLM-generated liver MRI reports by introducing a Multi-Dimensional Credibility Assessment (MDCA) framework and providing guidance on institution-specific prompt optimization. The proposed framework is applied to evaluate and compare the performance of several advanced LLMs, including Kimi-K2-Instruct-0905, Qwen3-235B-A22B-Instruct-2507, DeepSeek-V3, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
