CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, Yulan He

TL;DR
CODI introduces a novel self-distillation framework that compresses explicit chain-of-thought reasoning into a continuous latent space, achieving comparable performance with improved efficiency and robustness in large language models.
Contribution
It is the first implicit CoT method to match explicit CoT performance by training a student model to reason in continuous space through self-distillation.
Findings
Achieves 3.1x compression rate on GSM8k.
Outperforms previous state-of-the-art by 28.2% in accuracy.
Demonstrates robustness and interpretability in reasoning.
Abstract
Chain-of-Thought (CoT) reasoning enhances Large Language Models (LLMs) by encouraging step-by-step reasoning in natural language. However, leveraging a latent continuous space for reasoning may offer benefits in terms of both efficiency and robustness. Prior implicit CoT methods attempt to bypass language completely by reasoning in continuous space but have consistently underperformed compared to the standard explicit CoT approach. We introduce CODI (Continuous Chain-of-Thought via Self-Distillation), a novel training framework that effectively compresses natural language CoT into continuous space. CODI jointly trains a teacher task (Explicit CoT) and a student task (Implicit CoT), distilling the reasoning ability from language into continuous space by aligning the hidden states of a designated token. Our experiments show that CODI is the first implicit CoT approach to match the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
MethodsAttention Is All You Need · Cosine Annealing · Linear Warmup With Cosine Annealing · Softmax · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Residual Connection · Byte Pair Encoding · Weight Decay
