Evaluating Prompting and Execution-Based Methods for Deterministic Computation in LLMs

Hongkun Yu

arXiv:2605.03227·cs.AI·May 8, 2026

Evaluating Prompting and Execution-Based Methods for Deterministic Computation in LLMs

Hongkun Yu

PDF

TL;DR

This paper systematically evaluates prompting strategies for exact computation in LLMs, finding that program generation with external interpreters yields perfect accuracy, unlike standard methods.

Contribution

It introduces a synthetic dataset for controlled evaluation and demonstrates that program-of-Thought prompting with external interpreters outperforms other methods.

Findings

01

Program-of-Thought achieves perfect accuracy on synthetic tasks.

02

Standard prompting methods only achieve moderate accuracy.

03

Training a small domain-specific model yields reliable program generation.

Abstract

Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding and reasoning. However, their ability to perform exact, deterministic computation remains unclear. In this work, we systematically evaluate multiple prompting strategies, including Chain-of-Thought (CoT), Least-to-Most decomposition, Program-of-Thought (PoT), and Self-Consistency (SC), on tasks requiring precise and error-free outputs, including binary counting, longest substring detection, and arithmetic evaluation. To support this study, we introduce a synthetic dataset with diverse natural language instructions, enabling controlled evaluation of exact computation across multiple task types. Our results show that standard prompting methods achieve only moderate accuracy on sequence-based tasks. CoT provides limited improvement, while Least-to-Most suffers from error accumulation. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.