CrochetBench: Can Vision-Language Models Move from Describing to Doing in Crochet Domain?
Peiyu Li, Xiaobao Huang, Ting Hua, Nitesh V. Chawla

TL;DR
CrochetBench evaluates vision-language models' ability to generate precise, executable crochet procedures, revealing significant gaps in their reasoning and synthesis capabilities for real-world creative tasks.
Contribution
This paper introduces CrochetBench, a novel benchmark for assessing multimodal models' procedural reasoning in crochet using a new DSL and execution-based validation.
Findings
Performance drops from surface similarity to executable correctness
Models struggle with long-range symbolic reasoning
Limitations in 3D-aware procedural synthesis
Abstract
While multimodal large language models can describe visual content, their ability to generate executable procedures remains underexplored. CrochetBench presented in this paper evaluates this shift from describing to doing through fine-grained procedural reasoning in crochet: models must recognize stitches, select structurally appropriate instructions, and generate compilable procedures. We adopt the CrochetPARADE DSL as our intermediate representation, enabling structural validation and functional evaluation via execution. The benchmark covers tasks including stitch classification, instruction grounding, and both natural language and image-to-DSL translation. Across all tasks, performance sharply decreases as the evaluation shifts from surface-level similarity to executable correctness, revealing limitations in long-range symbolic reasoning and 3D-aware procedural synthesis. Our…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Execution-grounded evaluation. CrochetPARADE enables syntactic/structural validation and visualization/execution, providing a more faithful signal than BLEU/ROUGE alone. 2. Well-structured task ladder. Tasks escalate from perception to executable synthesis with clear metrics and sizes. 3. Dataset scale & coverage. Table 1 reports 6,085 patterns, 98.77% image coverage, 55 project types. 4. Clear gap at execution. Performance “declines as evaluation shifts to executable correctness”, project-le
1. Semantic equivalence vs. compilation. Compilation checks syntax/structure but can miss semantically equivalent programs. The authors motivate execution-based metrics but do not pair them with visual render agreement in main results. 2. More qualitative results are needed for better interpretation of the results.
- The authors introduced an interesting task, CrochetPARADE DSL, as it does have this nice property of verifiable, meaning it could be beneficial for other tasks, for example post-training RL. - I can see the challenge of crochet code generation, as it requires 3D-aware reasoning, and because it is a quite niche task, it is possible that current language models have not been trained on this tasks, making it less contaminated task and might better reflect model's performance differences
- the author emphasized that their benchmark focuses on the instruction fidelity, if the model can generate valid, compilable DSL code, based on multi-modal input, and opens a new direction for multimodal research, which I don't think they are the first to do this: for example, the whole area of letting LLMs/multi-modal LLMs to generate symbolic graphics programs like SVG (2D), CAD (3D) etc. which fullfill all the requirements and properties of this crochet DSL, have already being studied befor
- It is a new approach to employ crochet, a craft defined by its intricate structure and creativity, as a framework for evaluating a model's reasoning and code generation capabilities. - This benchmark highlights the limitations of existing vision-language models.
- The data, only sourced from the Yarn spirations website, may be biased toward specific design styles or formats, limiting its diversity and representativeness. - It is unclear whether the use of GPT-4o-mini for PDF conversion and annotation involved any manual error checking. - The evaluation for Task D only focuses on compilation success, without comparing the geometric and topological similarity between the compiled output and the reference design. - There is a progressive relationship betwe
The benchmark built around the crochet domain is conceptually interesting and could potentially introduce new challenges for VLMs.
I believe this paper has significant issues in its presentation, which makes it hard to follow and understand, and therefore difficult to assess its contribution. - It is hard to follow most parts of the paper. There are no examples or figures to help readers understand what the benchmark is assessing. Given that this is a highly specialized domain (crochet), many readers from the ICLR community may not have the relevant background knowledge. The only section that can somehow know the backgroun
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Data Visualization and Analytics
