When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation
Bian Sun, Zhenjian Wang, Orvill de la Torre, and Zirui Wang

TL;DR
This study adapts a language model for clinical dialogue evaluation using LoRA, comparing traditional metrics with GPT-4 judgments, revealing discrepancies that highlight the need for human validation in healthcare AI.
Contribution
It introduces a domain-specific adaptation of Llama-2-7B with LoRA for clinical dialogue, and compares automated metrics with LLM-based assessments, emphasizing validation challenges.
Findings
LoRA improved lexical similarity scores on clinical transcripts.
GPT-4 evaluation disagreed with traditional metrics, favoring baseline responses.
Automated metrics may not fully capture clinical utility, necessitating human validation.
Abstract
As Large Language Models (LLMs) are increasingly integrated into healthcare to address complex inquiries, ensuring their reliability remains a critical challenge. Recent studies have highlighted that generic LLMs often struggle in clinical contexts, occasionally producing misleading guidance. To mitigate these risks, this research focuses on the domain-specific adaptation of \textbf{Llama-2-7B} using the \textbf{Low-Rank Adaptation (LoRA)} technique. By injecting trainable low-rank matrices into the Transformer layers, we efficiently adapted the model using authentic patient-physician transcripts while preserving the foundational knowledge of the base model. Our objective was to enhance precision and contextual relevance in responding to medical queries by capturing the specialized nuances of clinical discourse. Due to the resource-intensive nature of large-scale human validation, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
