OLAPH: Improving Factuality in Biomedical Long-form Question Answering

Minbyul Jeong; Hyeon Hwang; Chanwoong Yoon; Taewhoo Lee; Jaewoo Kang

arXiv:2405.12701·cs.CL·October 16, 2024·2 cites

OLAPH: Improving Factuality in Biomedical Long-form Question Answering

Minbyul Jeong, Hyeon Hwang, Chanwoong Yoon, Taewhoo Lee, Jaewoo Kang

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

This paper introduces OLAPH, a novel framework for training large language models to improve factual accuracy in biomedical long-form question answering, using a new benchmark dataset and automatic evaluation methods.

Contribution

The paper presents OLAPH, a cost-effective, multifaceted evaluation framework that reduces hallucinations in LLMs and enhances factuality in biomedical long-form answers.

Findings

01

A 7B LLM trained with OLAPH matches medical experts' factuality in long answers.

02

OLAPH significantly improves LLM factuality on unseen evaluation metrics.

03

The MedLFQA dataset enables automatic evaluation of factual claims in biomedical QA.

Abstract

In the medical domain, numerous scenarios necessitate the long-form generation ability of large language models (LLMs). Specifically, when addressing patients' questions, it is essential that the model's response conveys factual claims, highlighting the need for an automated method to evaluate those claims. Thus, we introduce MedLFQA, a benchmark dataset reconstructed using long-form question-answering datasets related to the biomedical domain. We use MedLFQA to facilitate a cost-effective automatic evaluations of factuality. We also propose OLAPH, a simple and novel framework that utilizes cost-effective and multifaceted automatic evaluation to construct a synthetic preference set and answers questions in our preferred manner. Our framework leads us to train LLMs step-by-step to reduce hallucinations and include crucial medical claims. We highlight that, even on evaluation metrics not…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

The study topic in this paper is important as the hallucination problem is critical when applying LLMs in the health domain. The manual evaluation and the dataset would be beneficial to the community.

Weaknesses

1. The MedLFQA sets the answers in MUST HAVE and NICE TO HAVE and then calculates the hallucination and comprehensiveness metrics by comparing the generated text and the reference text. Although automatic hallucination detection and quantization are difficult, and it is worth exploring automatic evaluation methods, it is not persuasive that the current setting can effectively serve as the hallucination metric. The problems are: a) this setting only evaluates a subset of hallucination; the LLM ge

Reviewer 02Rating 6Confidence 4

Strengths

+ The paper addresses an important problem in the medical domain, where factuality is crucial for patient safety and trust in medical AI systems. + The paper is clearly written with well-structured presentation, clear visualizations, and illustrative examples. + The introduction of MedLFQA as a unified benchmark for evaluating factuality in biomedical LFQA is a valuable contribution to the field. + The effectiveness of OLAPH is comprehensively validated with thorough analyses, comparisons with

Weaknesses

+ The novelty of the proposed OLAPH framework is limited, as it mostly follows the standard preference optimization process (SFT and DPO)

Reviewer 03Rating 8Confidence 3

Strengths

The paper shows an innovative approach to enhancing the factual accuracy of long-form biomedical question answering. This work also creatively combines current techniques in preference-based learning and factual consistency checks to improve medical domain answers, addressing key limitations in factuality and response quality in prior research. This work is good in quality of methodological design. They have detailed each step in OLAPH’s alignment process making it easy for others to build on to

Weaknesses

The framework instroduced in this study is using GPT-4 to generate must-have and nice-to-have statements in medlfqa. My concern is that it may introduce biases or inaccuracies into the dataset. Although the researchers show that GPT-4-generated responses are close to human-curated answers, i did not find a critical analysis of where synthetic statements might diverge from medical experts. Also, i think the study framework is a good step towards medical LLMs, but it does not cover the framework'

Code & Models

Repositories

dmis-lab/olaph
pytorchOfficial

Datasets

dmis-lab/MedLFQA
dataset· 80 dl
80 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Biomedical Text Mining and Ontologies

MethodsSparse Evolutionary Training · ALIGN