Sketch-and-Verify: Structured Inference-Time Scaling via Program Sketching
Shan Jiang, Zijian Yi, Chenguang Zhu

TL;DR
Sketch-and-Verify introduces a structured inference-time scaling method that uses program sketching to efficiently explore diverse algorithmic strategies, outperforming flat sampling within the same computational budget.
Contribution
The paper presents a novel sketching-based approach for inference-time scaling that guarantees diverse algorithm exploration and demonstrates its effectiveness on HumanEval+ benchmarks.
Findings
Sketching outperforms flat sampling at matched candidate counts.
Cross-tier sketching complements tier upgrades but does not replace them.
Practitioners should use sketching when stronger tiers are unavailable.
Abstract
SKETCHVERIFY is a within-tier cost-performance policy, not a universal accuracy improvement. The operational question: a practitioner stuck with a small, cheap code model (here, Gemini 3.1 Flash Lite) for latency, deployment, or budget reasons -- how should they spend a small amount of extra test-time compute? SKETCHVERIFY factorizes the search space: the LLM enumerates K distinct algorithmic strategies, writes a program sketch for each (a partial program with ?? holes), and fills each sketch M times, producing K x M structurally diverse candidates that are verified by execution and selected by fingerprint clustering. Each extra sketch is guaranteed to explore a different algorithm; each extra flat sample likely duplicates an existing one. Our central evidence is a cost-quality Pareto plot on HumanEval+ across three Gemini tiers (Lite, Flash, Pro), and a reanalysis of the 19 problems…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
