KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models

Richard Sproat; Stefano Peluchetti

arXiv:2605.13322·cs.CV·May 19, 2026

KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models

Richard Sproat, Stefano Peluchetti

PDF

1 Datasets

TL;DR

KamonBench is a grammar-based dataset designed to evaluate compositional factor recovery in vision-language models using synthetic Japanese crests with known factors.

Contribution

It introduces a new benchmark with controlled synthetic data and multiple evaluation metrics for compositional visual recognition.

Findings

01

Baseline models show varying success in factor recovery.

02

KamonBench enables detailed analysis beyond caption accuracy.

03

Controlled experiments reveal model sensitivities to specific factors.

Abstract

Kamon (family crests) are an important part of Japanese culture and a natural test case for compositional visual recognition: each crest combines a small number of symbolic choices, but the space of possible descriptions is sparse. We introduce KamonBench, a grammar-based image-to-structure benchmark with 20,000 synthetic composite crests and auxiliary component examples. Each composite crest is paired with a formal kamon description language - "kamon y\=ogo" - description, a segmented Japanese analysis, an English translation, and a non-linguistic program code. Because each synthetic crest is generated from known factors, namely container, modifier, and motif, KamonBench supports evaluation beyond caption-level accuracy: direct program-code factor metrics, controlled factor-pair recombination splits, counterfactual motif-sensitivity groups under fixed container-modifier contexts, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

SakanaAI/KamonBench
dataset· 123 dl
123 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.