Scalable LLM Reasoning Acceleration with Low-rank Distillation
Harry Dong, Bilge Acun, Beidi Chen, Yuejie Chi

TL;DR
Caprese is a low-rank distillation method that efficiently recovers reasoning capabilities in large language models lost due to inference acceleration techniques, significantly reducing computation and latency without harming language task performance.
Contribution
Introduces Caprese, a novel low-rank distillation approach that restores reasoning abilities in efficient LLMs with minimal additional parameters and synthetic data.
Findings
Restores reasoning capabilities with only 1% additional parameters.
Reduces active parameters by approximately 2 billion.
Achieves over 16% reduction in inference latency.
Abstract
Due to long generations, large language model (LLM) math reasoning demands significant computational resources and time. While many existing efficient inference methods have been developed with excellent performance preservation on language tasks, they often severely degrade math performance. In this paper, we propose Caprese, a resource-efficient distillation method to recover lost capabilities from deploying efficient inference methods, focused primarily in feedforward blocks. With original weights unperturbed, roughly 1% of additional parameters, and only 20K synthetic training samples, we are able to recover much if not all of the reasoning capabilities lost from efficient inference for thinking LLMs and without harm to language tasks for instruct LLMs. Moreover, Caprese slashes the number of active parameters (~2B cut for Gemma 2 9B and Llama 3.1 8B) and integrates cleanly into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLLaMA
