Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA Benchmark
Evan M. Williams, Kathleen M. Carley

TL;DR
This paper introduces a benchmark to evaluate how well vision-language models like GPT-4 and LLaVa perform basic visual network analysis tasks, revealing significant challenges despite their advanced capabilities.
Contribution
It presents the first benchmark for assessing VLMs on fundamental visual network analysis tasks, highlighting their current limitations in this domain.
Findings
GPT-4 outperforms LLaVa in all tasks
Both models struggle with basic VNA tasks
Benchmark is publicly available for future research
Abstract
We evaluate the zero-shot ability of GPT-4 and LLaVa to perform simple Visual Network Analysis (VNA) tasks on small-scale graphs. We evaluate the Vision Language Models (VLMs) on 5 tasks related to three foundational network science concepts: identifying nodes of maximal degree on a rendered graph, identifying whether signed triads are balanced or unbalanced, and counting components. The tasks are structured to be easy for a human who understands the underlying graph theoretic concepts, and can all be solved by counting the appropriate elements in graphs. We find that while GPT-4 consistently outperforms LLaVa, both models struggle with every visual network analysis task we propose. We publicly release the first benchmark for the evaluation of VLMs on foundational VNA tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Translation Studies and Practices
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Position-Wise Feed-Forward Layer · Dropout · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding
