The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark
Hao Liu, Jicheng Liu

TL;DR
This paper introduces KnotBench, a comprehensive benchmark for evaluating vision-language models on diagrammatic knot reasoning tasks, revealing current models' limitations in understanding and manipulating knot structures.
Contribution
It provides a large, structured dataset and evaluation protocol for assessing diagrammatic reasoning in VLMs, highlighting their struggles with knot structure comprehension.
Findings
Models perform poorly on knot equivalence and move prediction tasks.
Thinking-mode reasoning improves model accuracy but only modestly.
Current VLMs lack the ability to simulate moves on diagram features.
Abstract
A vision-language model can look at a knot diagram and report what it sees, yet fail to act on that structure. KnotBench pairs an 858,318-image corpus from 1,951 prime-knot prototypes (crossing numbers 3 to 19) with a protocol whose answers are checked against Regina's canonical knot signature. Its 14 tasks span four families, equivalence judgment, move prediction, identification, and cross-modal grounding; an image-versus-symbol split locates failures along the perception-operation gap. We score Claude Opus 4.7 and GPT-5, each with and without thinking, under a 64K output-token budget matched on both vendors. Across 56 (task, model) cases, 15 sit at or below a random baseline and 8 of 14 tasks have a best score under 1.5x random. On diagram-to-symbol transcription, no model produces a strictly correct string, and permissive Regina decoding recovers the knot in 0 to 4 of 100 items.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
