Can Visual Encoder Learn to See Arrows?
Naoyuki Terashita, Yusuke Tozaki, Hideaki Omote, Congkha Nguyen, Ryosuke Nakamoto, Yuta Koreeda, Hiroaki Ozaki

TL;DR
This paper demonstrates that training vision language models on bias-free diagram datasets enables the image encoder to learn explicit edge features, improving diagram understanding beyond existing models.
Contribution
It introduces a contrastive training approach on bias-free diagrams, showing that VLMs can learn to recognize edges, which enhances their diagram comprehension capabilities.
Findings
Finetuned models outperform pretrained CLIP in diagram tasks.
The approach surpasses zero-shot GPT-4o and LLaVA-Mistral in captioning.
Eliminating textual and positional biases improves edge recognition.
Abstract
The diagram is a visual representation of a relationship illustrated with edges (lines or arrows), which is widely used in industrial and scientific communication. Although recognizing diagrams is essential for vision language models (VLMs) to comprehend domain-specific knowledge, recent studies reveal that many VLMs fail to identify edges in images. We hypothesize that these failures stem from an over-reliance on textual and positional biases, preventing VLMs from learning explicit edge features. Based on this idea, we empirically investigate whether the image encoder in VLMs can learn edge representation through training on a diagram dataset in which edges are biased neither by textual nor positional information. To this end, we conduct contrastive learning on an artificially generated diagram--caption dataset to train an image encoder and evaluate its diagram-related features on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Robot Manipulation and Learning · Manufacturing Process and Optimization
MethodsContrastive Language-Image Pre-training · Contrastive Learning
