TL;DR
This paper introduces Multi2, a scalable framework for multi-document summarization using test-time prompt ensemble and novel LLM-based metrics, improving summary quality and understanding scaling limits.
Contribution
It presents a new test-time scaling approach for MDS with prompt ensembles and introduces two metrics for better evaluation of summaries.
Findings
Enhanced summary quality through prompt ensemble methods.
New metrics (CAP and LLM-ACU) effectively evaluate summary consistency.
Identified practical scaling boundaries for multi-document summarization.
Abstract
Recent advances in test-time scaling have shown promising results in improving Large Language Model (LLM) performance through strategic computation allocation during inference. While this approach has demonstrated strong improvements in logical and mathematical reasoning tasks, its application to natural language generation (NLG), particularly summarization, remains unexplored. Multi-Document Summarization (MDS), a fundamental task in NLG, presents unique challenges by requiring models to extract and synthesize essential information across multiple lengthy documents. Unlike reasoning tasks, MDS demands a more nuanced approach to prompt design and ensemble methods, as no single "best" prompt can satisfy diverse summarization requirements. We propose a novel framework leveraging test-time scaling for MDS. Our approach employs prompt ensemble techniques to generate multiple candidate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
