An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques
Walid Mohamed Aly, Taysir Hassan A. Soliman, Amr Mohamed AbdelAziz

TL;DR
This paper systematically evaluates six large language models on text summarization tasks across diverse datasets using prompt engineering, highlighting performance variations and proposing a chunking strategy for long documents.
Contribution
It provides a comprehensive analysis of LLMs on summarization tasks with prompt engineering, introducing a chunking method for long documents and analyzing trade-offs between quality and efficiency.
Findings
LLMs perform well on news and dialog summarization.
Chunking improves long document summarization accuracy.
Performance varies with model size, dataset, and prompt design.
Abstract
Large Language Models (LLMs) continue to advance natural language processing with their ability to generate human-like text across a range of tasks. Despite the remarkable success of LLMs in Natural Language Processing (NLP), their performance in text summarization across various domains and datasets has not been comprehensively evaluated. At the same time, the ability to summarize text effectively without relying on extensive training data has become a crucial bottleneck. To address these issues, we present a systematic evaluation of six LLMs across four datasets: CNN/Daily Mail and NewsRoom (news), SAMSum (dialog), and ArXiv (scientific). By leveraging prompt engineering techniques including zero-shot and in-context learning, our study evaluates the performance using the ROUGE and BERTScore metrics. In addition, a detailed analysis of inference times is conducted to better understand…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
