Humans vs Large Language Models: Judgmental Forecasting in an Era of Advanced AI
MAhdi Abolghasemi, Odkhishig Ganbold, Kristian Rotaru

TL;DR
This study compares human experts and large language models in retail sales forecasting, revealing that LLMs do not consistently outperform humans and both are affected by promotions and external factors.
Contribution
It provides a controlled experimental comparison of human and LLM forecasting accuracy in retail, highlighting limitations and influencing factors.
Findings
LLMs do not consistently outperform humans in forecasting accuracy.
Promotional periods increase forecasting errors for both humans and LLMs.
Advanced statistical models do not always improve forecasting performance.
Abstract
This study investigates the forecasting accuracy of human experts versus Large Language Models (LLMs) in the retail sector, particularly during standard and promotional sales periods. Utilizing a controlled experimental setup with 123 human forecasters and five LLMs, including ChatGPT4, ChatGPT3.5, Bard, Bing, and Llama2, we evaluated forecasting precision through Mean Absolute Percentage Error. Our analysis centered on the effect of the following factors on forecasters performance: the supporting statistical model (baseline and advanced), whether the product was on promotion, and the nature of external impact. The findings indicate that LLMs do not consistently outperform humans in forecasting accuracy and that advanced statistical forecasting models do not uniformly enhance the performance of either human forecasters or LLMs. Both human and LLM forecasters exhibited increased…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsForecasting Techniques and Applications
