PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models

Ye Yu; Yaoning Yu; Haohan Wang

arXiv:2506.10716·cs.CL·June 13, 2025

PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models

Ye Yu, Yaoning Yu, Haohan Wang

PDF

Open Access 3 Reviews

TL;DR

PREMISE is a prompt-only framework that significantly reduces reasoning trace length and cost in large reasoning models while maintaining high accuracy, enabling more efficient deployment in resource-constrained settings.

Contribution

It introduces a novel prompt optimization method that minimizes reasoning verbosity and cost without altering model weights, applicable directly to commercial large language models.

Findings

01

Reduces reasoning tokens by up to 87.5%

02

Maintains or improves accuracy on mathematical benchmarks

03

Cuts dollar cost of inference by 69-82%

Abstract

Large reasoning models (LRMs) such as Claude 3.7 Sonnet and OpenAI o1 achieve strong performance on mathematical benchmarks using lengthy chain-of-thought (CoT) reasoning, but the resulting traces are often unnecessarily verbose. This inflates token usage and cost, limiting deployment in latency-sensitive or API-constrained settings. We introduce PREMISE (PRompt-based Efficient Mathematical Inference with Strategic Evaluation), a prompt-only framework that reduces reasoning overhead without modifying model weights. PREMISE combines trace-level diagnostics with gradient-inspired prompt optimization to minimize redundant computation while preserving answer accuracy. The approach jointly optimizes brevity and correctness through a multi-objective textual search that balances token length and answer validity. Unlike prior work, PREMISE runs in a single-pass black-box interface, so it can be…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

* The framework substantially reduces reasoning tokens by up to 84.3 percent, which translates to a reduction in monetary cost by as much as 82.2 percent. This makes Large Reasoning Models (LRMs) more viable for large-scale, cost-sensitive deployment. * PREMISE is a prompt-only method that requires no modification of the model's weights. It is crucial for deployment with commercial, proprietary LRM APIs where model internal access is restricted. * It employs a sophisticated multi-objective tex

Weaknesses

* The evaluation primarily focuses on relatively straightforward math datasets like GSM8K and SVAMP. The optimized prompt, which heavily emphasizes brevity and calculation-only output (A.3), may struggle with complex problems (like AIME or HMMT) that genuinely require and benefit from detailed, longer Chain-of-Thought reasoning. * The core method is based on TextGrad, which has been previously established for prompt optimization. While the application is novel for efficiency, the paper's optimi

Reviewer 02Rating 2Confidence 4

Strengths

The framework is interesting, and the result shows promising performances.

Weaknesses

This paper presents score improvement, however, it lacks technical details as well as baselines. I think it does not reach the high-level requirement of this conference. 1. I can not find the workflow or systematical presentation of their method. Therefore, it is hard for me to capture the technical details and contributions. The authors should include such figure in the manuscript. This is also not a theory paper I believe. 2. The baseline is not well-constructed. 1. The authors only consider

Reviewer 03Rating 4Confidence 3

Strengths

- Prompt-based, not necessarily to access the weight of the model - Cut reasoning tokens down by 84.3% and cut dollar cost by 82.2% while having 94.7% accuracy.

Weaknesses

## Major - Line 160, $I(r, q)$ is not clearly defined, how can you calculate this metric? You've only mentioned this measures the deviation, is this term differentiable? How is it defined? - Line 163, can you explain more about "shortest known correct" trace? How to know this metric without any prior knowledge? - Can you report the result of not using any thinking token, like the standard IO prompting? - Have you tried other benchmarks in Math or Logic? My impression of model performances on G

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Topic Modeling · Machine Learning in Materials Science