ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes
Shivam Kumar

TL;DR
ShapeCodeBench is a synthetic benchmark for perception-to-program reconstruction of shape scenes, evaluating models on their ability to generate executable drawing programs from images.
Contribution
It introduces a new synthetic benchmark with a diverse set of scenes and evaluation metrics, enabling systematic assessment of perception-to-program models.
Findings
Classical heuristics perform well on easy scenes but fail on complex overlaps.
GPT-5.5 achieves the highest exact match among tested models.
The benchmark remains challenging, with low overall exact match scores.
Abstract
We introduce ShapeCodeBench, a synthetic benchmark for perception-to-program reconstruction: given a rendered raster image, a model must emit an executable drawing program that a deterministic evaluator re-renders and compares with the target. The v1 DSL has four primitives on a 512 x 512 black-on-white canvas, but every instance is generated from a seeded RNG, so fresh held-out sets can be created to reduce exact-instance contamination. We release a frozen eval_v1 split with 150 samples across easy, medium, and hard tiers, scored by exact match, pixel accuracy, foreground IoU, parse success, and execution success. We evaluate an empty-program floor, a classical computer-vision heuristic, Claude Opus 4.7 at high and max effort, and GPT-5.5 at medium and extra_high reasoning effort. The heuristic is competitive on easy scenes but collapses when overlaps fuse components; the strongest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
