Hybrid Latent Reasoning via Reinforcement Learning
Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, Dong Wang

TL;DR
This paper proposes a reinforcement learning-based hybrid latent reasoning method for large language models, enabling continuous and discrete reasoning integration without relying on chain-of-thought traces, leading to improved performance and interpretability.
Contribution
It introduces HRPO, a novel RL-based approach that combines hidden states and token embeddings for latent reasoning in LLMs, overcoming previous incompatibilities.
Findings
HRPO outperforms prior methods on diverse benchmarks.
Models trained with HRPO exhibit interpretability and cross-lingual patterns.
HRPO maintains generative capabilities while enhancing reasoning performance.
Abstract
Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Reinforcement Learning in Robotics
