Benchmarking Large Language Model Volatility
Boyang Yu

TL;DR
This paper investigates the variability of Large Language Model outputs in financial sentiment analysis, revealing significant volatility that affects investment decisions and exploring mitigation strategies like temperature tuning and ensembling.
Contribution
It provides the first comprehensive analysis of LLM output volatility in financial text understanding and evaluates practical methods to manage this uncertainty.
Findings
LLM outputs show substantial sentence-level sentiment variability.
Output volatility significantly impacts portfolio construction and returns.
Ensembling reduces volatility but increases computational costs.
Abstract
The impact of non-deterministic outputs from Large Language Models (LLMs) is not well examined for financial text understanding tasks. Through a compelling case study on investing in the US equity market via news sentiment analysis, we uncover substantial variability in sentence-level sentiment classification results, underscoring the innate volatility of LLM outputs. These uncertainties cascade downstream, leading to more significant variations in portfolio construction and return. While tweaking the temperature parameter in the language model decoder presents a potential remedy, it comes at the expense of stifled creativity. Similarly, while ensembling multiple outputs mitigates the effect of volatile outputs, it demands a notable computational investment. This work furnishes practitioners with invaluable insights for adeptly navigating uncertainty in the integration of LLMs into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Stock Market Forecasting Methods · Natural Language Processing Techniques
