Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains

Wenhui Tan; Jiaze Li; Jianzhong Ju; Zhenbo Luo; Ruihua Song; Jian Luan

arXiv:2505.16552·cs.CL·February 4, 2026

Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains

Wenhui Tan, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Ruihua Song, Jian Luan

PDF

Open Access 2 Models

TL;DR

This paper introduces CoLaR, a framework that compresses reasoning chains in latent space to improve efficiency and speed in LLM reasoning, achieving significant reductions in reasoning length with minimal accuracy loss.

Contribution

CoLaR is a novel method that dynamically compresses reasoning in latent space using a two-stage training process and reinforcement learning, enabling faster and more efficient LLM reasoning.

Findings

01

Achieves 14.1% higher accuracy than baseline at similar compression levels.

02

Reduces reasoning chain length by 53.3% with only 4.8% performance loss.

03

Improves performance by up to 5.4% on challenging tasks with 82.8% reduction in reasoning length.

Abstract

Large Language Models (LLMs) achieve superior performance through Chain-of-Thought (CoT) reasoning, but these token-level reasoning chains are computationally expensive and inefficient. In this paper, we introduce Compressed Latent Reasoning (CoLaR), a novel framework that dynamically compresses reasoning processes in latent space through a two-stage training approach. First, during supervised fine-tuning, CoLaR extends beyond next-token prediction by incorporating an auxiliary next compressed embedding prediction objective. This process merges embeddings of consecutive tokens using a compression factor randomly sampled from a predefined range, and trains a specialized latent head to predict distributions of subsequent compressed embeddings. Second, we enhance CoLaR through reinforcement learning (RL) that leverages the latent head's non-deterministic nature to explore diverse reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings