When Reasoning Fails: Evaluating 'Thinking' LLMs for Stock Prediction

Rakeshkumar H Sodha

arXiv:2511.08608·q-fin.ST·November 13, 2025

When Reasoning Fails: Evaluating 'Thinking' LLMs for Stock Prediction

Rakeshkumar H Sodha

PDF

Open Access

TL;DR

This study evaluates whether 'thinking' large language models improve stock prediction in complex, noisy financial environments, finding they underperform compared to direct LLMs and classical methods under current conditions.

Contribution

It provides empirical evidence that reasoning-augmented LLMs do not outperform direct LLMs or classical models in stock prediction tasks with high complexity and noise.

Findings

01

TLLMs' ranking quality deteriorates with increased universe size.

02

TLLMs exhibit higher variance requiring calibration.

03

Portfolio performance under costs does not favor TLLMs.

Abstract

Problem. "Thinking" LLMs (TLLMs) expose explicit or hidden reasoning traces and are widely believed to generalize better on complex tasks than direct LLMs. Whether this promise carries to noisy, heavy-tailed and regime-switching financial data remains unclear. Approach. Using Indian equities (NIFTY constituents), we run a rolling 48m/1m walk-forward evaluation at horizon k = 1 day and dial cross-sectional complexity via the universe size U in {5, 11, 21, 36} while keeping the reasoning budget fixed (B = 512 tokens) for the TLLM. We compare a direct LLM (gpt-4o-mini), a TLLM (gpt-5), and classical learners (ridge, random forest) on cross-sectional ranking loss 1 - IC, MSE, and long/short backtests with realistic costs. Statistical confidence is measured with Diebold-Mariano, Pesaran-Timmermann, and SPA tests. Main findings. (i) As U grows under a fixed budget B, the TLLM's ranking…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStock Market Forecasting Methods · Explainable Artificial Intelligence (XAI) · Financial Distress and Bankruptcy Prediction