Towards Dataset-scale and Feature-oriented Evaluation of Text Summarization in Large Language Model Prompts
Sam Yu-Te Lee, Aryaman Bahukhandi, Dongyu Liu, and Kwan-Liu Ma

TL;DR
This paper proposes a feature-oriented evaluation workflow and a visual analytics system, Awesum, to systematically assess and refine text summarization prompts for large language models, emphasizing interpretability and user-friendliness.
Contribution
It introduces a novel feature-based prompt evaluation approach and the Awesum system, facilitating non-technical users to effectively optimize prompts at dataset scale.
Findings
The system helps non-technical users evaluate prompts more easily.
Feature-oriented evaluation can generalize beyond text summarization.
The workflow improves prompt refinement efficiency.
Abstract
Recent advancements in Large Language Models (LLMs) and Prompt Engineering have made chatbot customization more accessible, significantly reducing barriers to tasks that previously required programming skills. However, prompt evaluation, especially at the dataset scale, remains complex due to the need to assess prompts across thousands of test instances within a dataset. Our study, based on a comprehensive literature review and pilot study, summarized five critical challenges in prompt evaluation. In response, we introduce a feature-oriented workflow for systematic prompt evaluation. In the context of text summarization, our workflow advocates evaluation with summary characteristics (feature metrics) such as complexity, formality, or naturalness, instead of using traditional quality metrics like ROUGE. This design choice enables a more user-friendly evaluation of prompts, as it guides…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
