KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models
Richard Sproat, Stefano Peluchetti

TL;DR
KamonBench is a grammar-based dataset designed to evaluate compositional factor recovery in vision-language models using synthetic Japanese crests with known factors.
Contribution
It introduces a new benchmark with controlled synthetic data and multiple evaluation metrics for compositional visual recognition.
Findings
Baseline models show varying success in factor recovery.
KamonBench enables detailed analysis beyond caption accuracy.
Controlled experiments reveal model sensitivities to specific factors.
Abstract
Kamon (family crests) are an important part of Japanese culture and a natural test case for compositional visual recognition: each crest combines a small number of symbolic choices, but the space of possible descriptions is sparse. We introduce KamonBench, a grammar-based image-to-structure benchmark with 20,000 synthetic composite crests and auxiliary component examples. Each composite crest is paired with a formal kamon description language - "kamon y\=ogo" - description, a segmented Japanese analysis, an English translation, and a non-linguistic program code. Because each synthetic crest is generated from known factors, namely container, modifier, and motif, KamonBench supports evaluation beyond caption-level accuracy: direct program-code factor metrics, controlled factor-pair recombination splits, counterfactual motif-sensitivity groups under fixed container-modifier contexts, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
