Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models
Mousa Salah, Amgad Muneer

TL;DR
This study systematically evaluates how temperature settings affect the performance of different prompting strategies in extended reasoning large language models, revealing optimal configurations vary with strategy and temperature.
Contribution
It provides the first comprehensive analysis of temperature effects on prompting strategies in extended reasoning LLMs, highlighting the importance of joint optimization.
Findings
Zero-shot prompting peaks at moderate temperatures (T=0.4, 0.7) with 59% accuracy.
Chain-of-thought prompting performs best at temperature extremes.
Extended reasoning benefits increase from 6x to 14.3x as temperature rises.
Abstract
Extended reasoning models represent a transformative shift in Large Language Model (LLM) capabilities by enabling explicit test-time computation for complex problem solving. However, the optimal configuration of sampling temperature and prompting strategy for these systems remains largely underexplored. We systematically evaluate chain-of-thought and zero-shot prompting across four temperature settings (0.0, 0.4, 0.7, and 1.0) using Grok-4.1 with extended reasoning on 39 mathematical problems from AMO-Bench, a challenging International Mathematical Olympiad-level benchmark. We find that zero-shot prompting achieves peak performance at moderate temperatures, reaching 59% accuracy at T=0.4 and T=0.7, while chain-of-thought prompting performs best at the temperature extremes. Most notably, the benefit of extended reasoning increases from 6x at T=0.0 to 14.3x at T=1.0. These results suggest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
