Inference-Time Chain-of-Thought Pruning with Latent Informativeness Signals

Sophie Li; Nicholas Huang; Nayan Saxena; Nina Luo; Vincent Lin; Kevin Zhu; Sunishchal Dev

arXiv:2511.00699·cs.LG·November 5, 2025

Inference-Time Chain-of-Thought Pruning with Latent Informativeness Signals

Sophie Li, Nicholas Huang, Nayan Saxena, Nina Luo, Vincent Lin, Kevin Zhu, Sunishchal Dev

PDF

Open Access

TL;DR

This paper introduces KAPPA, a new inference-time pruning method for large language models that uses a principled scoring function to reduce computational costs while maintaining reasoning accuracy.

Contribution

KAPPA combines KL divergence, confidence, and entropy into a scoring function for effective branch pruning during inference, improving efficiency over existing heuristics.

Findings

01

KAPPA stabilizes performance in smaller models.

02

Achieves up to 60% memory reduction.

03

Reduces total token generation by 90%.

Abstract

Large language models (LLMs) improve reasoning accuracy when generating multiple candidate solutions at test time, but standard methods like Best-of-N (BoN) incur high computational cost by fully generating all branches. Self-Truncation Best-of-N (ST-BoN) mitigates this by truncating unpromising paths early, but its reliance on consistency-based heuristics is a limitation as it does not directly evaluate branch quality. We present KL-Adjusted Pruned Path Algorithm (KAPPA), an inference-time method that combines Kullback-Leibler divergence, confidence, and entropy into a principled scoring function to guide progressive pruning. By promoting diversity during exploration and selectively eliminating low-scoring branches, KAPPA maintains accuracy while substantially reducing memory and token usage. Experiments on GSM8K and MATH500 with DeepSeek-R1-Distill-Qwen-1.5B and Qwen2.5-7B-Instruct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Software System Performance and Reliability