Reinforced Latent Reasoning for LLM-based Recommendation
Yang Zhang, Wenxin Xu, Xiaoyan Zhao, Wenjie Wang, Fuli Feng, Xiangnan He, Tat-Seng Chua

TL;DR
This paper introduces LatentR3, a reinforcement learning framework that enables large language models to perform efficient, implicit reasoning for recommendation tasks without explicit chain-of-thought data, improving performance and inference speed.
Contribution
It proposes a novel end-to-end RL-based training method for latent reasoning in LLMs, eliminating the need for explicit reasoning data and enhancing recommendation accuracy.
Findings
LatentR3 improves recommendation performance across various LLMs.
The framework reduces inference latency by avoiding explicit reasoning generation.
Reinforcement learning effectively trains latent reasoning modules without supervision.
Abstract
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks, sparking growing interest in their application to preference reasoning in recommendation systems. Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data. However, these methods face significant practical limitations due to (1) the difficulty of obtaining high-quality CoT data in recommendation and (2) the high inference latency caused by generating CoT reasoning. In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning. This approach eliminates the need for explicit CoT generation and improves inference efficiency, as few latent tokens can effectively capture the entire reasoning process. Building on this idea, we propose \textit{\underline{R}einforced…
Peer Reviews
Decision·ICLR 2026 Poster
1. Timely exploration of latent reasoning for recommendation. 2. The LatentRATT layer appears effective for generating latent vectors that improve performance. 3. The advantage computation using batch-level averaging is a clear and reasonable change to GRPO. 4. Extensive experiments on four public datasets. 5. Code is available during the review period.
1. Why are LLMs necessary here? * The authors state that prior latent-reasoning-for-recommendation works are not LLM-based and that their approach is tailored to LLMs. * However, the core techniques, namely LatentRATT and the modified GRPO algorithm, are not inherently tied to language modeling. In principle they could be applied to conventional recommenders such as SASRec. The "reasoning" is encoded in a latent vector of length 1, not in explicit linguistic reasoning. * To support t
The paper makes a wise and well-justified design choice. For the sequential recommendation problem, latent reasoning is indeed a more suitable approach. Rather than performing explicit chain-of-thought reasoning, which assumes some human-like logical process, the task here is essentially about learning an approximator that fits the pattern of the next purchased (or interacted) item based on historical data. In fact, most sequential recommendation problems do not lend themselves to explicit CoT
1. In the experiments, the authors use Qwen2.5-1.5B as the base LLM and keep it frozen during training, only updating the LatentRATT module. This raises several concerns. First, given that Qwen2.5-1.5B is a relatively small model and easy to fine-tune, it would be reasonable to jointly train the entire model rather than freezing the backbone. Such joint optimization might lead to better results. Moreover, an additional baseline should be included, a fully fine-tuned Qwen2.5-1.5B model trained wi
1. The framework successfully addresses a major practical limitation of LLM-based recommenders by eliminating the generation of verbose explicit CoT text. By compressing the reasoning into a few latent tokens, it achieves high performance while maintaining efficiency, which is crucial for real-time deployment. 2. $\text{LatentR}^3$ utilizes a reinforcement learning approach to optimize the latent reasoning process, which allows the model to learn effective reasoning strategies directly from the
1. The experimental validation is restricted in scope. Firstly, it only uses a specific family of datasets (Amazon review data), lacking tests on other popular and structurally different public benchmarks like MovieLens-1M. Secondly, the framework is only implemented and tested on relatively small-scale LLM backbones (e.g., $D^3$ or BIGRec, likely based on BERT or similar models), leaving its scalability and continued effectiveness on large, cutting-edge foundation models (e.g., Llama-7B/13B) un
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
MethodsSparse Evolutionary Training
