OLAPH: Improving Factuality in Biomedical Long-form Question Answering
Minbyul Jeong, Hyeon Hwang, Chanwoong Yoon, Taewhoo Lee, Jaewoo Kang

TL;DR
This paper introduces OLAPH, a novel framework for training large language models to improve factual accuracy in biomedical long-form question answering, using a new benchmark dataset and automatic evaluation methods.
Contribution
The paper presents OLAPH, a cost-effective, multifaceted evaluation framework that reduces hallucinations in LLMs and enhances factuality in biomedical long-form answers.
Findings
A 7B LLM trained with OLAPH matches medical experts' factuality in long answers.
OLAPH significantly improves LLM factuality on unseen evaluation metrics.
The MedLFQA dataset enables automatic evaluation of factual claims in biomedical QA.
Abstract
In the medical domain, numerous scenarios necessitate the long-form generation ability of large language models (LLMs). Specifically, when addressing patients' questions, it is essential that the model's response conveys factual claims, highlighting the need for an automated method to evaluate those claims. Thus, we introduce MedLFQA, a benchmark dataset reconstructed using long-form question-answering datasets related to the biomedical domain. We use MedLFQA to facilitate a cost-effective automatic evaluations of factuality. We also propose OLAPH, a simple and novel framework that utilizes cost-effective and multifaceted automatic evaluation to construct a synthetic preference set and answers questions in our preferred manner. Our framework leads us to train LLMs step-by-step to reduce hallucinations and include crucial medical claims. We highlight that, even on evaluation metrics not…
Peer Reviews
Decision·Submitted to ICLR 2025
The study topic in this paper is important as the hallucination problem is critical when applying LLMs in the health domain. The manual evaluation and the dataset would be beneficial to the community.
1. The MedLFQA sets the answers in MUST HAVE and NICE TO HAVE and then calculates the hallucination and comprehensiveness metrics by comparing the generated text and the reference text. Although automatic hallucination detection and quantization are difficult, and it is worth exploring automatic evaluation methods, it is not persuasive that the current setting can effectively serve as the hallucination metric. The problems are: a) this setting only evaluates a subset of hallucination; the LLM ge
+ The paper addresses an important problem in the medical domain, where factuality is crucial for patient safety and trust in medical AI systems. + The paper is clearly written with well-structured presentation, clear visualizations, and illustrative examples. + The introduction of MedLFQA as a unified benchmark for evaluating factuality in biomedical LFQA is a valuable contribution to the field. + The effectiveness of OLAPH is comprehensively validated with thorough analyses, comparisons with
+ The novelty of the proposed OLAPH framework is limited, as it mostly follows the standard preference optimization process (SFT and DPO)
The paper shows an innovative approach to enhancing the factual accuracy of long-form biomedical question answering. This work also creatively combines current techniques in preference-based learning and factual consistency checks to improve medical domain answers, addressing key limitations in factuality and response quality in prior research. This work is good in quality of methodological design. They have detailed each step in OLAPH’s alignment process making it easy for others to build on to
The framework instroduced in this study is using GPT-4 to generate must-have and nice-to-have statements in medlfqa. My concern is that it may introduce biases or inaccuracies into the dataset. Although the researchers show that GPT-4-generated responses are close to human-curated answers, i did not find a critical analysis of where synthetic statements might diverge from medical experts. Also, i think the study framework is a good step towards medical LLMs, but it does not cover the framework'
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies
MethodsSparse Evolutionary Training · ALIGN
