$\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space

Peihao Wang; Ruisi Cai; Zhen Wang; Hongyuan Mei; Qiang Liu; Pan Li; Zhangyang Wang

arXiv:2603.04948·cs.LG·March 6, 2026

$\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space

Peihao Wang, Ruisi Cai, Zhen Wang, Hongyuan Mei, Qiang Liu, Pan Li, Zhangyang Wang

PDF

Open Access 3 Reviews

TL;DR

The paper introduces $ abla$-Reasoner, a novel test-time gradient-based optimization framework for LLMs that improves reasoning accuracy and efficiency by integrating differentiable optimization into decoding.

Contribution

It proposes Differentiable Textual Optimization (DTO) for on-the-fly policy refinement using gradient signals, shifting from search-based to gradient-based inference methods.

Findings

01

Achieves over 20% accuracy improvement on mathematical reasoning benchmarks.

02

Reduces model calls by 10-40% compared to baselines.

03

Provides theoretical insights linking gradient descent in sample space to reinforcement learning.

Abstract

Scaling inference-time compute for Large Language Models (LLMs) has unlocked unprecedented reasoning capabilities. However, existing inference-time scaling methods typically rely on inefficient and suboptimal discrete search algorithms or trial-and-error prompting to improve the online policy. In this paper, we propose $\nabla$ -Reasoner, an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop to refine the policy on the fly. Our core component, Differentiable Textual Optimization (DTO), leverages gradient signals from both the LLM's likelihood and a reward model to refine textual representations. $\nabla$ -Reasoner further incorporates rejection sampling and acceleration design to robustify and speed up decoding. Theoretically, we show that performing inference-time gradient descent in the sample space to maximize reward is…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

Originality: - The shift from zeroth-order search to first-order gradient-based optimization for test-time reasoning is conceptually appealing and well-motivated by Figure 1 - The theoretical connection between DTO and PPO via Wasserstein gradient flow (Theorem 4.1) provides an elegant unification of parametric and non-parametric inference perspectives - The gradient decomposition (δ_prefix, δ_postfix, δ_reward) offers clear intuition about how DTO enables bidirectional information flow along se

Weaknesses

Experimental Analysis: - While the cost comparison uses "number of calls," actual wall-clock time could differ due to backward passes. Could you provide runtime measurements to complement the theoretical cost analysis? - The comparison with RAP and ToT might not be entirely fair if those methods weren't given comparable computational budgets. Could you ensure all baselines use similar total compute? - Table 1 shows ∇-Reasoner sometimes underperforms training-based GRPO (e.g., Qwen-2.5-7B on AMC:

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper is well-written, with a clear motivation. 2. The proposed method, DTO, is interesting and novel, offering a new way based on gradient optimization to test-time scaling. And DTO is supported with theoretical justification. 3. Extensive experiments show a significant accuracy and efficiency improvement.

Weaknesses

1. Lack of reward models. In Appendix D, I notice you apply different reward models for differetn policy models. May I ask why? Could you offer an ablation study for the choice of different reward models for the same policy model. 2. The improvement statement in the abstract is overstated. The 20% accuracy improvement and 40% less computation are for different baselines, incuring some confusion.

Reviewer 03Rating 6Confidence 3

Strengths

1.The method demonstrates outstanding performance across multiple benchmarks. 2.The authors made practical considerations, enabling the proposed method to integrate well with existing LLM inference acceleration infrastructures. 3.The authors derived gradients over discrete text and attempted to provide theoretical guarantees for DTO

Weaknesses

1. The authors introduced a hyperparameter in the objective function but did not conduct experiments to analyze its impact. 2.The proposed method heavily relies on the performance of the reward model, yet only one reward model was used in the experiments. We hope to see results under multiple reward models.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications