Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors
Alex Chandler, Devesh Surve, Hui Su

TL;DR
DEEP is an end-to-end LLM framework that effectively detects factual errors in text summaries by ensembling diverse prompts and calibrating outputs, achieving state-of-the-art accuracy without model fine-tuning.
Contribution
The paper introduces DEEP, a novel prompt ensembling approach that improves factual error detection in summaries without requiring fine-tuning or complex thresholding.
Findings
Achieves state-of-the-art accuracy on multiple summarization benchmarks.
Outperforms prior models significantly without fine-tuning.
Provides a practical, threshold-free error detection method.
Abstract
Accurate text summarization is one of the most common and important tasks performed by Large Language Models, where the costs of human review for an entire document may be high, but the costs of errors in summarization may be even greater. We propose Detecting Errors through Ensembling Prompts (DEEP) - an end-to-end large language model framework for detecting factual errors in text summarization. Our framework uses a diverse set of LLM prompts to identify factual inconsistencies, treating their outputs as binary features, which are then fed into ensembling models. We then calibrate the ensembled models to produce empirically accurate probabilities that a text is factually consistent or free of hallucination. We demonstrate that prior models for detecting factual errors in summaries perform significantly worse without optimizing the thresholds on subsets of the evaluated dataset. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSoftware Engineering Research · Risk and Safety Analysis
MethodsSparse Evolutionary Training
