RETuning: Upgrading Inference-Time Scaling for Stock Movement Prediction with Large Language Models
Xueyuan Lin, Cehao Yang, Ye Ma, Ming Li, Rongjunchen Zhang, Yang Ni, Xiaojun Wu, Chengjin Xu, Jian Guo, Hui Xiong

TL;DR
This paper introduces RETuning, a method to improve large language models' reasoning for stock movement prediction by constructing and scoring evidence from diverse sources, leading to more reliable and independent analytical predictions.
Contribution
The paper proposes Reflective Evidence Tuning (RETuning), a novel approach that enhances LLMs' reasoning in financial tasks by dynamically organizing evidence before prediction.
Findings
RETuning improves prediction accuracy on stock movement tasks.
Models maintain reasoning ability over time and out-of-distribution stocks.
Large-scale dataset enables comprehensive evaluation of models.
Abstract
Recently, large language models (LLMs) have demonstrated outstanding reasoning capabilities on mathematical and coding tasks. However, their application to financial tasks-especially the most fundamental task of stock movement prediction-remains underexplored. We study a three-class classification problem (up, hold, down) and, by analyzing existing reasoning responses, observe that: (1) LLMs follow analysts' opinions rather than exhibit a systematic, independent analytical logic (CoTs). (2) LLMs list summaries from different sources without weighing adversarial evidence, yet such counterevidence is crucial for reliable prediction. It shows that the model does not make good use of its reasoning ability to complete the task. To address this, we propose Reflective Evidence Tuning (RETuning), a cold-start method prior to reinforcement learning, to enhance prediction ability. While…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This paper develops a multimodal dataset including numerical and text data of stock markets for enhancing the stock price classification with reasoning LLMs. 2. The experiment is conducted on multiple LLMs and financial tasks in addition to stock price classification.
1. Arbitrary labeling threshold The target setup appears arbitrary, as labels are defined using a fixed 3% threshold. Given that stock markets are highly dynamic with varying spreads, a more convincing rationale is needed to justify this static labeling choice. 2. Clarity of results in Figure 3 Figure 3 is unclear. It is not specified which results correspond to experiments conducted without using chain-of-thought (CoT). 3. Language coverage in the dataset In the appendix, most of the
The following stands out as strengths of the paper: 1. Well-motivated inference-time optimization: RETuning effectively links SFT and RL (via GRPO) for structured reasoning enhancement without architectural changes. 2. Experiments: Fair amount of experiments across multiple financial datasets and OOD splits with transparent hyperparameters. 3. Strong performance: Up to +20.7% F1 improvement on Fin-2024 benchmarks (**self-proposed, contributed** dataset), outperforming strong LLM baselines. 4
1. **Novelty**: The use of 2-stage --- under various monikers (teacher-student; System 1,2; ...) --- inference-time mechanisms are copious in AI/ML/Robotics literature. Maybe relatively uncommon under domain-specificity (finanical markets), but the general idea has been explored in this specific task (SMP) earlier (e.g., [1](https://openreview.net/forum?id=y3W1TVuJii&referrer=%5Bthe%20profile%20of%20Raeid%20Saqur%5D(%2Fprofile%3Fid%3D~Raeid_Saqur1) \). 2. Domain limitation, Evaluation Scope a
1) The paper is well written and presents its methodology in a clear and logically coherent manner. 2) RETuning introduces a reasoning-driven paradigm that goes beyond pattern matching by encouraging models to construct, score, and reflect on evidence before prediction. Also, the framework’s ability to generalize to other financial tasks in BizFinBench (Table 2) further illustrates its adaptability and transferability beyond stock prediction. 3) The paper provides carefully designed empirical
1) Although the paper claims significant gains in predictive accuracy (mainly via F1 score), it omits financially meaningful metrics such as cumulative return, Sharpe ratio, maximum drawdown, and Sharp ratio. Without these, the study is hard to demonstrate whether RETuning’s predictions translate into better risk-adjusted profitability or trading performance. This is a critical omission for a paper targeting real-world stock-trading applications, where financial utility and stability matter more
Overall, the paper is well-organized and readable. The efforts above prompt engineering are pretty much, including SFT, RL, and dataset construction. The problem is also interesting.
- The target is overnight gap return instead of next-day close-to-close movement. This setting limits economic interpretability and predictive scope. The justification that this choice reduces memorization is unconvincing. What LLMs truly memorize are majority the news, company entities, etc. so that the LLMs are likely to know which companies are the historical winners, that's where the survivorship bias and look-ahead bias occurs. It is less likely (though not impossible) that LLMs are trained
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
