Programming Puzzles
Tal Schuster, Ashwin Kalyan, Oleksandr Polozov, Adam Tauman Kalai

TL;DR
This paper introduces programming puzzles as a new objective benchmark for program synthesis, providing an open-source dataset and baseline solvers, demonstrating varying success rates across problem difficulties.
Contribution
It presents a comprehensive dataset of Python programming puzzles and baseline solvers, enabling evaluation of program synthesis methods without natural language reliance.
Findings
Codex solves up to 18% of problems in one try
Baseline solvers improve with multiple attempts
Puzzle difficulty correlates with human and AI performance
Abstract
We introduce a new type of programming challenge called programming puzzles, as an objective and comprehensive evaluation of program synthesis, and release an open-source dataset of Python Programming Puzzles (P3). Each puzzle is defined by a short Python program , and the goal is to find an input which makes return True. The puzzles are objective in that each one is specified entirely by the source code of its verifier , so evaluating is all that is needed to test a candidate solution. They do not require an answer key or input/output examples, nor do they depend on natural language understanding. The dataset is comprehensive in that it spans problems of a range of difficulties and domains, ranging from trivial string manipulation problems, to classic programming puzzles (e.g., Tower of Hanoi), to interview/competitive-programming problems (e.g., dynamic programming), to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
[ML News] Hugging Face course | GAN Theft Auto | AI Programming Puzzles | PyTorch 1.9 Released· youtube
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Teaching and Learning Programming
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Weight Decay · Multi-Head Attention · Dropout · Byte Pair Encoding · Adam · Cosine Annealing · {Dispute@FaQ-s}How to file a dispute with Expedia?
