Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA   Benchmark

Evan M. Williams; Kathleen M. Carley

arXiv:2405.06634·cs.CV·June 11, 2024

Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA Benchmark

Evan M. Williams, Kathleen M. Carley

PDF

Open Access 1 Repo

TL;DR

This paper introduces a benchmark to evaluate how well vision-language models like GPT-4 and LLaVa perform basic visual network analysis tasks, revealing significant challenges despite their advanced capabilities.

Contribution

It presents the first benchmark for assessing VLMs on fundamental visual network analysis tasks, highlighting their current limitations in this domain.

Findings

01

GPT-4 outperforms LLaVa in all tasks

02

Both models struggle with basic VNA tasks

03

Benchmark is publicly available for future research

Abstract

We evaluate the zero-shot ability of GPT-4 and LLaVa to perform simple Visual Network Analysis (VNA) tasks on small-scale graphs. We evaluate the Vision Language Models (VLMs) on 5 tasks related to three foundational network science concepts: identifying nodes of maximal degree on a rendered graph, identifying whether signed triads are balanced or unbalanced, and counting components. The tasks are structured to be easy for a human who understands the underlying graph theoretic concepts, and can all be solved by counting the appropriate elements in graphs. We find that while GPT-4 consistently outperforms LLaVa, both models struggle with every visual network analysis task we propose. We publicly release the first benchmark for the evaluation of VLMs on foundational VNA tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

evanup/vna_benchmark
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Translation Studies and Practices

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Position-Wise Feed-Forward Layer · Dropout · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding