FutureMind: Equipping Small Language Models with Strategic Thinking-Pattern Priors via Adaptive Knowledge Distillation
Shaoxiong Yang, Junting Li, Mengyuan Zhang, Chao Li, Wei Liu, Jian Luan

TL;DR
FutureMind enhances small language models with strategic reasoning abilities through adaptive knowledge distillation, enabling efficient handling of complex, knowledge-intensive tasks with improved accuracy and reasoning skills.
Contribution
The paper introduces a modular reasoning framework that equips small language models with structured thinking-pattern priors via adaptive knowledge distillation from large models.
Findings
Outperforms strong baselines on multi-hop QA benchmarks
Achieves state-of-the-art results across diverse SLM architectures
Reveals cognitive bias bottleneck in reasoning skill transfer
Abstract
Small Language Models (SLMs) are attractive for cost-sensitive and resource-limited settings due to their efficient, low-latency inference. However, they often struggle with complex, knowledge-intensive tasks that require structured reasoning and effective retrieval. To address these limitations, we propose FutureMind, a modular reasoning framework that equips SLMs with strategic thinking-pattern priors via adaptive knowledge distillation from large language models (LLMs). FutureMind introduces a dynamic reasoning pipeline composed of four key modules: Problem Analysis, Logical Reasoning, Strategy Planning, and Retrieval Guidance. This pipeline is augmented by three distinct retrieval paradigms that decompose complex queries into tractable subproblems, ensuring efficient and accurate retrieval execution. Extensive experiments on multi-hop QA benchmarks, including 2WikiMultihopQA,…
Peer Reviews
Decision·ICLR 2026 Poster
1. The most interesting aspect of this work is its redefinition of knowledge distillation. While traditional distillation often involves the student mimicking the teacher's trajectory, FutureMind empowers the student to master the teacher's structured thinking framework. 2. The experimental results reveal that an overly complex plan can become a barrier to the SLM's understanding and execution. This is a counter-intuitive and interesting phenomenon, for which the authors have provided a cor
1. I would like to ask the authors if they have considered or attempted to use more direct metrics to quantify or predict the "cognitive compatibility" between teacher and student models, beyond observing final performance in experiments. For instance, could the distillation effectiveness be anticipated by analyzing the relationship between the complexity of the teacher's plan and the capability limits of the student model? 2. The experiments in this paper are primarily focused on structure
- The main idea of the paper is interesting and has clear merit. Leveraging larger LMs to guide smaller ones during multi-hop reasoning is a natural and well-motivated direction. The method is also training free, something that showcases its efficiency and potential to be deployed. - The experimental results are promising. Across all model scales, integrating FutureMind consistently outperforms the other baselines in almost all cases, which highlights its potential. - The ablation study in Secti
- I think that the way each module is presented in Section 3, although detailed, is unnecessarily abstract and ends up confusing the reader instead of describing the modules clearly. Also, many parts are not adequately explained (such as what exactly is function $\mathcal{F}$), and there is a gap between the high-level overview of each module, and the actual implementation (for instance, it is not clear how the authors prompt the larger LM at each step of the pipeline). - A suggestion could be
1. The proposed method is intuitive and well-elaborated. 2. The proposed method demonstrates effectiveness and generalizability on four multi-hop QA datasets, using LLMs of different series and sizes.
1. This paper claims to be "a new state of the art among training-free methods", but only Naive Generation, Standard RAG, and Search-o1 are compared, ignoring various baselines on inference-time scaling methods and training-free LLM frameworks. 2. Section 4.1 Datasets: "we randomly sample 500 instances from the validation sets of 2WikiMQA and MuSiQue". However, 2Wiki has about 12.6k validation instances, and MuSiQue has 2.4k validation instances. Sampling only 500 per dataset here is too limited
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Text Readability and Simplification
