Decision-Oriented Text Evaluation
Yu-Shiang Huang, Chuan-Ju Wang, Chung-Chi Chen

TL;DR
This paper introduces a decision-oriented evaluation framework for generated text, focusing on its impact on decision-making outcomes in financial contexts, revealing limitations of traditional metrics and the potential of collaborative human-LLM teams.
Contribution
It proposes a novel decision-based evaluation method for NLG that directly measures decision outcomes, emphasizing collaborative human-LLM decision-making over traditional intrinsic metrics.
Findings
Neither humans nor LLMs outperform random baselines with summaries alone.
Rich analytical commentaries improve decision-making performance.
Collaborative human-LLM teams outperform individual agents and humans.
Abstract
Natural language generation (NLG) is increasingly deployed in high-stakes domains, yet common intrinsic evaluation methods, such as n-gram overlap or sentence plausibility, weakly correlate with actual decision-making efficacy. We propose a decision-oriented framework for evaluating generated text by directly measuring its influence on human and large language model (LLM) decision outcomes. Using market digest texts--including objective morning summaries and subjective closing-bell analyses--as test cases, we assess decision quality based on the financial performance of trades executed by human investors and autonomous LLM agents informed exclusively by these texts. Our findings reveal that neither humans nor LLM agents consistently surpass random performance when relying solely on summaries. However, richer analytical commentaries enable collaborative human-LLM teams to outperform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Sentiment Analysis and Opinion Mining
