Beyond Templates: Dynamic Adaptation of Reasoning Demonstrations via Feasibility-Aware Exploration

Yong Wu; Weihang Pan; Ke Li; Chen Binhui; Ping Li; Binbin Lin

arXiv:2505.20700·cs.CL·May 28, 2025

Beyond Templates: Dynamic Adaptation of Reasoning Demonstrations via Feasibility-Aware Exploration

Yong Wu, Weihang Pan, Ke Li, Chen Binhui, Ping Li, Binbin Lin

PDF

Open Access 4 Reviews

TL;DR

This paper introduces DART, a framework that dynamically adapts reasoning demonstrations for small language models by selectively imitating expert steps and exploring alternative reasoning paths, improving reasoning ability and data efficiency.

Contribution

DART is a novel data adaptation method that aligns reasoning trajectories with small language models' capabilities through selective imitation and autonomous exploration.

Findings

01

DART significantly improves reasoning performance across benchmarks.

02

It enhances data efficiency compared to static fine-tuning.

03

DART generalizes well across different model scales.

Abstract

Large language models (LLMs) have shown remarkable reasoning capabilities, yet aligning such abilities to small language models (SLMs) remains a challenge due to distributional mismatches and limited model capacity. Existing reasoning datasets, typically designed for powerful LLMs, often lead to degraded performance when directly applied to weaker models. In this work, we introduce Dynamic Adaptation of Reasoning Trajectories (DART), a novel data adaptation framework that bridges the capability gap between expert reasoning trajectories and diverse SLMs. Instead of uniformly imitating expert steps, DART employs a selective imitation strategy guided by step-wise adaptability estimation via solution simulation. When expert steps surpass the student's capacity -- signaled by an Imitation Gap -- the student autonomously explores alternative reasoning paths, constrained by outcome…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

The paper is original in reframing process supervision as feasibility-aware selective imitation with outcome-consistent exploration, dynamically adapting expert trajectories to a student’s capacity rather than uniformly copying steps. This formulation is clearly introduced with a concrete problem setup and algorithms, and the training signal is motivated to match the student’s own history, improving robustness. Empirically, the study evaluates across multiple math-reasoning datasets and several

Weaknesses

The paper’s novelty over existing selective imitation, verifier or PRM-guided process supervision, and rejection-sampling fine-tuning is unclear; a head-to-head with these and strong test-time scaling baselines would clarify incremental value. The method relies on answer-checkable math tasks and narrow datasets, so generalization to open-ended outputs is uncertain; broaden beyond math and provide failure analyses and significance tests. Ablations are missing on key knobs that drive cost and beha

Reviewer 02Rating 2Confidence 3

Strengths

### 1. **Insight on The Imitation Gap** The paper identifies and empirically validates the **“Imitation Gap”** — a critical phenomenon where continued imitation of expert reasoning harms small models once their cognitive limits are exceeded. This insight explains failure cases of previous imitation-based fine-tuning and offers a theoretically grounded basis for adaptive learning. ### 2. **Clarity of Methodology** The authors provide a clear step-by-step formalization of DART’s simulation, adap

Weaknesses

### 1. **Unclear Reliability of the “Imitation Feasibility” Estimate** The core mechanism — step-wise adaptability estimation via solution simulation— hinges on Monte Carlo rollouts to decide whether a model can imitate a given reasoning step. However, the paper provides no rigorous validation of this signal’s reliability or stability across datasets and model sizes. Adaptability scores might be noisy, dataset-specific, or overly sensitive to sampling temperature and rollout randomness, undermi

Reviewer 03Rating 2Confidence 4

Strengths

- The problem of distribution mismatch between a teacher and the student models is important to address and this work is focusing on an important research question. The research idea is well presented in the paper and it is well structured and easy to understand. Although the related work section is a bit incomplete, other areas seem to be well presented. - Results use models of different sizes to study the scaling effect of the approach and the results are presented on a long list of mathemati

Weaknesses

- The experiments are quite limited with nearly all baselines missing. The dynamic data collection, which trajectories to choose and when has been studied a lot in the past. Comparison with other baselines are needed to prove the point. For example, a similar problem is tackled in the SIKeD[1] paper where the authors discuss the exact same things, use very similar models and datasets and argue that using a different reasoning chain can assist in learning better and when to use student generation

Reviewer 04Rating 2Confidence 4

Strengths

1. Interesting idea: The paper introduces "imitation gap." It finds the exact point where a small model gets confused by an expert's complex example and then smartly lets the small model find its own, simpler way to the right answer. The "imitation gap" transforms a low-quality, harmful training example into a high-quality, customized one. Instead of just discarding a hard example entirely, DART use the example in a new way. 2. Easy to Read: The paper is well-written and uses a simple Figure 1

Weaknesses

1. Baselines: The authors didn't compare DART to a much simpler, cheaper alternative, like just finding and removing all the hard examples at the start. For example, only use pass@8 > 1 trajectories and remove those pass@8==0. It can also easily remove those too "hard" questions. Because "medium" examples are easy to learn. The paper only compares to "raw" datasets, which only proves it works better than using the "teacher" solutions directly. 2. Evaluation datasets: it only proves the DART me

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Data Stream Mining Techniques · Logic, Reasoning, and Knowledge