Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLMs with Iterative Distillation of Experience

Krishna Sayana; Ketan Todi; Ambarish Jash

arXiv:2605.14443·cs.AI·May 15, 2026

Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLMs with Iterative Distillation of Experience

Krishna Sayana, Ketan Todi, Ambarish Jash

PDF

TL;DR

This paper introduces a reinforcement learning framework that trains prompting policies for black-box LLMs, significantly improving multi-step reasoning and tool-use performance through iterative distillation.

Contribution

It presents a novel RL-based approach with experience distillation for optimizing prompts, outperforming existing methods on diverse reasoning benchmarks.

Findings

01

Performance improved from 55% to 90% in reasoning tasks

02

Achieved 74% to 91% accuracy in tool-use tasks

03

Outperformed state-of-the-art evolutionary baselines like GEPA

Abstract

The shift toward interacting with frozen, "black-box" Large Language Models (LLMs) has transformed prompt engineering from a heuristic exercise into a critical optimization challenge. We propose a Reinforcement Learning (RL) framework for training learned prompting policies via iterative distillation of experience. In this architecture, a lightweight prompter model is optimized to maximize task-specific rewards for a larger, frozen worker LLM. By utilizing a contrastive experience buffer that couples scalar rewards with dense textual critiques, our approach effectively amortizes iterative prompt refinement into single-shot policy weights. Our experimental analysis focuses on the Big Bench Extra Hard (BBEH) and Tau-bench suites, covering a diverse range of multi-step reasoning and tool-use tasks. We demonstrate significant gains, improving performance from 55% to 90% in logic-intensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.