Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models
Haruto Yoshida, Keito Kudo, Yoichi Aoki, Ryota Tanaka, Itsumi Saito, Keisuke Sakaguchi, Kentaro Inui

TL;DR
This paper investigates why large vision-language models struggle with understanding relationships in diagrams, revealing that edge information is encoded later in the processing pipeline than node information.
Contribution
The study uncovers the stage at which different diagram elements are linearly encoded in LVLMs, highlighting the delayed encoding of edge information compared to nodes.
Findings
Edge information is not linearly separable in the vision encoder.
Node information is linearly encoded in the vision encoder.
Edge representations emerge later in the processing pipeline.
Abstract
Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges (e.g., arrows and lines). To investigate the underlying causes of this limitation, we probe the internal representation of LVLMs using a carefully constructed synthetic diagram dataset based on directed graphs. Our probing experiments reveal that edge information is not linearly separable in the vision encoder and becomes linearly encoded only in the text tokens in the language model. In contrast, node information and global structural features are already linearly encoded in individual hidden states of the vision encoder. These findings suggest that the stage at which linearly separable representations are formed varies depending on the type of visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Advanced Graph Neural Networks
