Can Visual Encoder Learn to See Arrows?

Naoyuki Terashita; Yusuke Tozaki; Hideaki Omote; Congkha Nguyen; Ryosuke Nakamoto; Yuta Koreeda; Hiroaki Ozaki

arXiv:2505.19944·cs.CV·May 27, 2025

Can Visual Encoder Learn to See Arrows?

Naoyuki Terashita, Yusuke Tozaki, Hideaki Omote, Congkha Nguyen, Ryosuke Nakamoto, Yuta Koreeda, Hiroaki Ozaki

PDF

Open Access

TL;DR

This paper demonstrates that training vision language models on bias-free diagram datasets enables the image encoder to learn explicit edge features, improving diagram understanding beyond existing models.

Contribution

It introduces a contrastive training approach on bias-free diagrams, showing that VLMs can learn to recognize edges, which enhances their diagram comprehension capabilities.

Findings

01

Finetuned models outperform pretrained CLIP in diagram tasks.

02

The approach surpasses zero-shot GPT-4o and LLaVA-Mistral in captioning.

03

Eliminating textual and positional biases improves edge recognition.

Abstract

The diagram is a visual representation of a relationship illustrated with edges (lines or arrows), which is widely used in industrial and scientific communication. Although recognizing diagrams is essential for vision language models (VLMs) to comprehend domain-specific knowledge, recent studies reveal that many VLMs fail to identify edges in images. We hypothesize that these failures stem from an over-reliance on textual and positional biases, preventing VLMs from learning explicit edge features. Based on this idea, we empirically investigate whether the image encoder in VLMs can learn edge representation through training on a diagram dataset in which edges are biased neither by textual nor positional information. To this end, we conduct contrastive learning on an artificially generated diagram--caption dataset to train an image encoder and evaluate its diagram-related features on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Robot Manipulation and Learning · Manufacturing Process and Optimization

MethodsContrastive Language-Image Pre-training · Contrastive Learning