The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark

Hao Liu; Jicheng Liu

arXiv:2605.09900·cs.AI·May 12, 2026

The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark

Hao Liu, Jicheng Liu

PDF

TL;DR

This paper introduces KnotBench, a comprehensive benchmark for evaluating vision-language models on diagrammatic knot reasoning tasks, revealing current models' limitations in understanding and manipulating knot structures.

Contribution

It provides a large, structured dataset and evaluation protocol for assessing diagrammatic reasoning in VLMs, highlighting their struggles with knot structure comprehension.

Findings

01

Models perform poorly on knot equivalence and move prediction tasks.

02

Thinking-mode reasoning improves model accuracy but only modestly.

03

Current VLMs lack the ability to simulate moves on diagram features.

Abstract

A vision-language model can look at a knot diagram and report what it sees, yet fail to act on that structure. KnotBench pairs an 858,318-image corpus from 1,951 prime-knot prototypes (crossing numbers 3 to 19) with a protocol whose answers are checked against Regina's canonical knot signature. Its 14 tasks span four families, equivalence judgment, move prediction, identification, and cross-modal grounding; an image-versus-symbol split locates failures along the perception-operation gap. We score Claude Opus 4.7 and GPT-5, each with and without thinking, under a 64K output-token budget matched on both vendors. Across 56 (task, model) cases, 15 sit at or below a random baseline and 8 of 14 tasks have a best score under 1.5x random. On diagram-to-symbol transcription, no model produces a strictly correct string, and permissive Regina decoding recovers the knot in 0 to 4 of 100 items.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.