Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot   Visual Question Answering

Bowen Jiang; Zhijun Zhuang; Shreyas S. Shivakumar; Dan Roth; Camillo; J. Taylor

arXiv:2403.14783·cs.CV·March 25, 2024·1 cites

Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering

Bowen Jiang, Zhijun Zhuang, Shreyas S. Shivakumar, Dan Roth, Camillo, J. Taylor

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multi-agent system for zero-shot visual question answering that leverages foundation models with specialized agents, aiming to improve robustness and practicality without dataset fine-tuning.

Contribution

It proposes a novel adaptive multi-agent framework for zero-shot VQA, addressing limitations of foundation models in object detection and counting.

Findings

01

Preliminary results demonstrate potential in zero-shot scenarios

02

System shows robustness without dataset fine-tuning

03

Identifies failure cases to guide future research

Abstract

This work explores the zero-shot capabilities of foundation models in Visual Question Answering (VQA) tasks. We propose an adaptive multi-agent system, named Multi-Agent VQA, to overcome the limitations of foundation models in object detection and counting by using specialized agents as tools. Unlike existing approaches, our study focuses on the system's performance without fine-tuning it on specific VQA datasets, making it more practical and robust in the open world. We present preliminary experimental results under zero-shot scenarios and highlight some failure cases, offering new directions for future research.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bowen-upenn/Multi-Agent-VQA
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques