Beyond Imitation: Reinforcement Learning for Active Latent Planning
Zhi Zheng, Wee Sun Lee

TL;DR
This paper introduces ATP-Latent, a novel active latent planning method that uses a VAE and reinforcement learning to improve reasoning accuracy and efficiency in large language models by optimizing latent token representations.
Contribution
It proposes an active latent planning framework with a VAE-based supervision and RL-guided policy to enhance reasoning in LLMs, addressing limitations of passive imitation.
Findings
+4.1% accuracy on benchmarks
-3.3% tokens used compared to baselines
Improved latent reasoning policy effectiveness
Abstract
Aiming at efficient and dense chain-of-thought (CoT) reasoning, latent reasoning methods fine-tune Large Language Models (LLMs) to substitute discrete language tokens with continuous latent tokens. These methods consume fewer tokens compared to the conventional language CoT reasoning and have the potential to plan in a dense latent space. However, current latent tokens are generally supervised based on imitating language labels. Considering that there can be multiple equivalent but diverse CoT labels for a question, passively imitating an arbitrary one may lead to inferior latent token representations and latent reasoning policies, undermining the potential planning ability and resulting in clear gaps between training and testing. In this work, we emphasize the importance of active planning over the representation space of latent tokens in achieving the optimal latent reasoning policy.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Machine Learning in Healthcare
