Reinforced Latent Reasoning for LLM-based Recommendation

Yang Zhang; Wenxin Xu; Xiaoyan Zhao; Wenjie Wang; Fuli Feng; Xiangnan He; Tat-Seng Chua

arXiv:2505.19092·cs.AI·October 27, 2025

Reinforced Latent Reasoning for LLM-based Recommendation

Yang Zhang, Wenxin Xu, Xiaoyan Zhao, Wenjie Wang, Fuli Feng, Xiangnan He, Tat-Seng Chua

PDF

Open Access 3 Reviews

TL;DR

This paper introduces LatentR3, a reinforcement learning framework that enables large language models to perform efficient, implicit reasoning for recommendation tasks without explicit chain-of-thought data, improving performance and inference speed.

Contribution

It proposes a novel end-to-end RL-based training method for latent reasoning in LLMs, eliminating the need for explicit reasoning data and enhancing recommendation accuracy.

Findings

01

LatentR3 improves recommendation performance across various LLMs.

02

The framework reduces inference latency by avoiding explicit reasoning generation.

03

Reinforcement learning effectively trains latent reasoning modules without supervision.

Abstract

Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks, sparking growing interest in their application to preference reasoning in recommendation systems. Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data. However, these methods face significant practical limitations due to (1) the difficulty of obtaining high-quality CoT data in recommendation and (2) the high inference latency caused by generating CoT reasoning. In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning. This approach eliminates the need for explicit CoT generation and improves inference efficiency, as few latent tokens can effectively capture the entire reasoning process. Building on this idea, we propose \textit{\underline{R}einforced…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. Timely exploration of latent reasoning for recommendation. 2. The LatentRATT layer appears effective for generating latent vectors that improve performance. 3. The advantage computation using batch-level averaging is a clear and reasonable change to GRPO. 4. Extensive experiments on four public datasets. 5. Code is available during the review period.

Weaknesses

1. Why are LLMs necessary here? * The authors state that prior latent-reasoning-for-recommendation works are not LLM-based and that their approach is tailored to LLMs. * However, the core techniques, namely LatentRATT and the modified GRPO algorithm, are not inherently tied to language modeling. In principle they could be applied to conventional recommenders such as SASRec. The "reasoning" is encoded in a latent vector of length 1, not in explicit linguistic reasoning. * To support t

Reviewer 02Rating 4Confidence 5

Strengths

The paper makes a wise and well-justified design choice. For the sequential recommendation problem, latent reasoning is indeed a more suitable approach. Rather than performing explicit chain-of-thought reasoning, which assumes some human-like logical process, the task here is essentially about learning an approximator that fits the pattern of the next purchased (or interacted) item based on historical data. In fact, most sequential recommendation problems do not lend themselves to explicit CoT

Weaknesses

1. In the experiments, the authors use Qwen2.5-1.5B as the base LLM and keep it frozen during training, only updating the LatentRATT module. This raises several concerns. First, given that Qwen2.5-1.5B is a relatively small model and easy to fine-tune, it would be reasonable to jointly train the entire model rather than freezing the backbone. Such joint optimization might lead to better results. Moreover, an additional baseline should be included, a fully fine-tuned Qwen2.5-1.5B model trained wi

Reviewer 03Rating 6Confidence 4

Strengths

1. The framework successfully addresses a major practical limitation of LLM-based recommenders by eliminating the generation of verbose explicit CoT text. By compressing the reasoning into a few latent tokens, it achieves high performance while maintaining efficiency, which is crucial for real-time deployment. 2. $\text{LatentR}^3$ utilizes a reinforcement learning approach to optimize the latent reasoning process, which allows the model to learn effective reasoning strategies directly from the

Weaknesses

1. The experimental validation is restricted in scope. Firstly, it only uses a specific family of datasets (Amazon review data), lacking tests on other popular and structurally different public benchmarks like MovieLens-1M. Secondly, the framework is only implemented and tested on relatively small-scale LLM backbones (e.g., $D^3$ or BIGRec, likely based on BERT or similar models), leaving its scalability and continued effectiveness on large, cutting-edge foundation models (e.g., Llama-7B/13B) un

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies

MethodsSparse Evolutionary Training