Shape vs. Context: Examining Human--AI Gaps in Ambiguous Japanese Character Recognition
Daichi Haraguchi

TL;DR
This study compares human and Vision-Language Model decision patterns in ambiguous Japanese character recognition, revealing differences and conditions that improve alignment, thus providing insights for benchmarking human-AI decision alignment.
Contribution
It introduces a method to compare human and VLM decision boundaries using interpolated Japanese characters and explores how context affects their alignment.
Findings
Humans and VLMs differ in shape-only recognition boundaries.
Context can improve VLM-human alignment in some cases.
Behavioral differences highlight the need for better alignment benchmarks.
Abstract
High text recognition performance does not guarantee that Vision-Language Models (VLMs) share human-like decision patterns when resolving ambiguity. We investigate this behavioral gap by directly comparing humans and VLMs using continuously interpolated Japanese character shapes generated via a -VAE. We estimate decision boundaries in a single-character recognition (shape-only task) and evaluate whether VLM responses align with human judgments under shape in context (i.e., embedding an ambiguous character near the human decision boundary in word-level context). We find that human and VLM decision boundaries differ in the shape-only task, and that shape in context can improve human alignment in some conditions. These results highlight qualitative behavioral differences, offering foundational insights toward human--VLM alignment benchmarking.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Text Readability and Simplification
