FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence
Sebastian Antony Joseph, Lily Chen, Jan Trienes, Hannah Louisa G\"oke,, Monika Coers, Wei Xu, Byron C Wallace, Junyi Jessy Li

TL;DR
FactPICO introduces a benchmark for evaluating the factual accuracy of plain language summaries of medical RCTs generated by LLMs, highlighting current challenges and the poor correlation of existing metrics with expert assessments.
Contribution
This paper presents FactPICO, a new factuality benchmark with expert-annotated summaries and evaluations for medical evidence summarization by LLMs, including the development of new LLM-based metrics.
Findings
Existing metrics poorly correlate with expert judgments.
Summarization of medical evidence remains challenging for LLMs.
Factuality balancing in medical summaries is difficult.
Abstract
Plain language summarization with LLMs can be useful for improving textual accessibility of technical content. But how factual are these summaries in a high-stakes domain like medicine? This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts describing randomized controlled trials (RCTs), which are the basis of evidence-based medicine and can directly inform patient treatment. FactPICO consists of 345 plain language summaries of RCT abstracts generated from three LLMs (i.e., GPT-4, Llama-2, and Alpaca), with fine-grained evaluation and natural language rationales from experts. We assess the factuality of critical elements of RCTs in those summaries: Populations, Interventions, Comparators, Outcomes (PICO), as well as the reported findings concerning these. We also evaluate the correctness of the extra information (e.g., explanations) added…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Computational and Text Analysis Methods
MethodsLinear Layer · Dense Connections · Label Smoothing · Adam · Attention Is All You Need · Softmax · Multi-Head Attention · Layer Normalization · Dropout · Residual Connection
