Evaluating Prompting and Execution-Based Methods for Deterministic Computation in LLMs
Hongkun Yu

TL;DR
This paper systematically evaluates prompting strategies for exact computation in LLMs, finding that program generation with external interpreters yields perfect accuracy, unlike standard methods.
Contribution
It introduces a synthetic dataset for controlled evaluation and demonstrates that program-of-Thought prompting with external interpreters outperforms other methods.
Findings
Program-of-Thought achieves perfect accuracy on synthetic tasks.
Standard prompting methods only achieve moderate accuracy.
Training a small domain-specific model yields reliable program generation.
Abstract
Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding and reasoning. However, their ability to perform exact, deterministic computation remains unclear. In this work, we systematically evaluate multiple prompting strategies, including Chain-of-Thought (CoT), Least-to-Most decomposition, Program-of-Thought (PoT), and Self-Consistency (SC), on tasks requiring precise and error-free outputs, including binary counting, longest substring detection, and arithmetic evaluation. To support this study, we introduce a synthetic dataset with diverse natural language instructions, enabling controlled evaluation of exact computation across multiple task types. Our results show that standard prompting methods achieve only moderate accuracy on sequence-based tasks. CoT provides limited improvement, while Least-to-Most suffers from error accumulation. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
