Large Language Models and Stock Investing: Is the Human Factor Required?
Ricardo Crisostomo, Diana Mykhalyuk

TL;DR
This study assesses the effectiveness of large language models in stock prediction, highlighting their limitations and the importance of human oversight and grounding in official data for improved accuracy.
Contribution
It systematically evaluates multiple LLMs and prompting strategies, revealing their potential and challenges in financial forecasting with an emphasis on human supervision.
Findings
LLMs face reasoning failures like misconceptions and hallucinations.
Supervised guidance can enable LLMs to outperform the market.
Grounding recommendations in official filings improves forecast accuracy.
Abstract
This paper investigates whether large language models (LLMs) can generate reliable stock market predictions. We evaluate four state-of-the-art models - ChatGPT, Gemini, DeepSeek, and Perplexity - across three prompting strategies: a naive query, a structured approach, and chain-of-thought reasoning. Our results show that LLM-generated recommendations are hindered by recurring reasoning failures, including financial misconceptions, carryover errors, and reliance on outdated or hallucinated information. When appropriately guided and supervised, LLMs demonstrate the capacity to outperform the market, but realizing LLMs' full potential requires substantial human oversight. We also find that grounding stock recommendations in official regulatory filings increases their forecasting accuracy. Overall, our findings underscore the need for robust safeguards and validation when deploying LLMs in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
