Visual Graph Question Answering with ASP and LLMs for Language Parsing
Jakob Johannes Bauer (ETH Zuerich, Switzerland), Thomas Eiter (TU, Wien, Austria), Nelson Higuera Ruiz (TU Wien, Austria), Johannes Oetsch, (Jonkoping University, Sweden)

TL;DR
This paper presents a modular neuro-symbolic system combining ASP, LLMs, and neural networks to address graph-based visual question answering, achieving 73% accuracy on a new dataset of transit network images.
Contribution
It introduces a novel neuro-symbolic approach integrating ASP, LLMs, and neural networks for graph VQA, with a new dataset and baseline performance.
Findings
Achieved 73% accuracy on the new graph VQA dataset.
Demonstrated the effectiveness of pretrained models without additional training.
Showcased the potential of neuro-symbolic systems for complex visual reasoning.
Abstract
Visual Question Answering (VQA) is a challenging problem that requires to process multimodal input. Answer-Set Programming (ASP) has shown great potential in this regard to add interpretability and explainability to modular VQA architectures. In this work, we address the problem of how to integrate ASP with modules for vision and natural language processing to solve a new and demanding VQA variant that is concerned with images of graphs (not graphs in symbolic form). Images containing graph-based structures are an ubiquitous and popular form of visualisation. Here, we deal with the particular problem of graphs inspired by transit networks, and we introduce a novel dataset that amends an existing one by adding images of graphs that resemble metro lines. Our modular neuro-symbolic approach combines optical graph recognition for graph parsing, a pretrained optical character recognition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
