VSA4VQA: Scaling a Vector Symbolic Architecture to Visual Question Answering on Natural Images
Anna Penzkofer, Lei Shi, Andreas Bulling

TL;DR
VSA4VQA introduces a novel 4D vector symbolic architecture for natural image representation and visual question answering, enabling complex spatial queries and achieving competitive zero-shot performance on the GQA dataset.
Contribution
It extends Vector Symbolic Architectures to handle complex natural images and spatial queries in VQA, integrating learned masks and a pre-trained model for attribute questions.
Findings
Effective encoding of natural images with spatial attributes.
Competitive zero-shot VQA performance on GQA dataset.
First VSA-based model to scale to complex spatial queries.
Abstract
While Vector Symbolic Architectures (VSAs) are promising for modelling spatial cognition, their application is currently limited to artificially generated images and simple spatial queries. We propose VSA4VQA - a novel 4D implementation of VSAs that implements a mental representation of natural images for the challenging task of Visual Question Answering (VQA). VSA4VQA is the first model to scale a VSA to complex spatial queries. Our method is based on the Semantic Pointer Architecture (SPA) to encode objects in a hyperdimensional vector space. To encode natural images, we extend the SPA to include dimensions for object's width and height in addition to their spatial location. To perform spatial queries we further introduce learned spatial query masks and integrate a pre-trained vision-language model for answering attribute-related questions. We evaluate our method on the GQA benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
