Distilling System 2 into System 1
Ping Yu, Jing Xu, Jason Weston, Ilia Kulikov

TL;DR
This paper explores methods to distill the reasoning capabilities of System 2 techniques in large language models into more efficient System 1 responses, improving performance while reducing inference costs.
Contribution
It introduces self-supervised distillation methods to embed System 2 reasoning into System 1 outputs, enhancing efficiency and performance.
Findings
Distillation improves System 1 performance over baseline.
Reduced inference cost compared to System 2 techniques.
Effective self-supervised methods for reasoning distillation.
Abstract
Large language models (LLMs) can spend extra compute during inference to generate intermediate thoughts, which helps to produce better final responses. Since Chain-of-Thought (Wei et al., 2022), many such System 2 techniques have been proposed such as Rephrase and Respond (Deng et al., 2023a), System 2 Attention (Weston and Sukhbaatar, 2023) and Branch-Solve-Merge (Saha et al., 2023). In this work we investigate self-supervised methods to ``compile'' (distill) higher quality outputs from System 2 techniques back into LLM generations without intermediate reasoning token sequences, as this reasoning has been distilled into System 1. We show that several such techniques can be successfully distilled, resulting in improved results compared to the original System 1 performance, and with less inference cost than System 2. We posit that such System 2 distillation will be an important feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Topic Modeling
MethodsSoftmax · Attention Is All You Need · Focus
