FactPICO: Factuality Evaluation for Plain Language Summarization of   Medical Evidence

Sebastian Antony Joseph; Lily Chen; Jan Trienes; Hannah Louisa G\"oke,; Monika Coers; Wei Xu; Byron C Wallace; Junyi Jessy Li

arXiv:2402.11456·cs.CL·June 6, 2024·5 cites

FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence

Sebastian Antony Joseph, Lily Chen, Jan Trienes, Hannah Louisa G\"oke,, Monika Coers, Wei Xu, Byron C Wallace, Junyi Jessy Li

PDF

Open Access 1 Repo 1 Video

TL;DR

FactPICO introduces a benchmark for evaluating the factual accuracy of plain language summaries of medical RCTs generated by LLMs, highlighting current challenges and the poor correlation of existing metrics with expert assessments.

Contribution

This paper presents FactPICO, a new factuality benchmark with expert-annotated summaries and evaluations for medical evidence summarization by LLMs, including the development of new LLM-based metrics.

Findings

01

Existing metrics poorly correlate with expert judgments.

02

Summarization of medical evidence remains challenging for LLMs.

03

Factuality balancing in medical summaries is difficult.

Abstract

Plain language summarization with LLMs can be useful for improving textual accessibility of technical content. But how factual are these summaries in a high-stakes domain like medicine? This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts describing randomized controlled trials (RCTs), which are the basis of evidence-based medicine and can directly inform patient treatment. FactPICO consists of 345 plain language summaries of RCT abstracts generated from three LLMs (i.e., GPT-4, Llama-2, and Alpaca), with fine-grained evaluation and natural language rationales from experts. We assess the factuality of critical elements of RCTs in those summaries: Populations, Interventions, Comparators, Outcomes (PICO), as well as the reported findings concerning these. We also evaluate the correctness of the extra information (e.g., explanations) added…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lilywchen/factpico
noneOfficial

Videos

FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence· underline

Taxonomy

TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Computational and Text Analysis Methods

MethodsLinear Layer · Dense Connections · Label Smoothing · Adam · Attention Is All You Need · Softmax · Multi-Head Attention · Layer Normalization · Dropout · Residual Connection