When Reasoning Fails: Evaluating 'Thinking' LLMs for Stock Prediction
Rakeshkumar H Sodha

TL;DR
This study evaluates whether 'thinking' large language models improve stock prediction in complex, noisy financial environments, finding they underperform compared to direct LLMs and classical methods under current conditions.
Contribution
It provides empirical evidence that reasoning-augmented LLMs do not outperform direct LLMs or classical models in stock prediction tasks with high complexity and noise.
Findings
TLLMs' ranking quality deteriorates with increased universe size.
TLLMs exhibit higher variance requiring calibration.
Portfolio performance under costs does not favor TLLMs.
Abstract
Problem. "Thinking" LLMs (TLLMs) expose explicit or hidden reasoning traces and are widely believed to generalize better on complex tasks than direct LLMs. Whether this promise carries to noisy, heavy-tailed and regime-switching financial data remains unclear. Approach. Using Indian equities (NIFTY constituents), we run a rolling 48m/1m walk-forward evaluation at horizon k = 1 day and dial cross-sectional complexity via the universe size U in {5, 11, 21, 36} while keeping the reasoning budget fixed (B = 512 tokens) for the TLLM. We compare a direct LLM (gpt-4o-mini), a TLLM (gpt-5), and classical learners (ridge, random forest) on cross-sectional ranking loss 1 - IC, MSE, and long/short backtests with realistic costs. Statistical confidence is measured with Diebold-Mariano, Pesaran-Timmermann, and SPA tests. Main findings. (i) As U grows under a fixed budget B, the TLLM's ranking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStock Market Forecasting Methods · Explainable Artificial Intelligence (XAI) · Financial Distress and Bankruptcy Prediction
