How Good is Google Bard's Visual Understanding? An Empirical Study on   Open Challenges

Haotong Qin; Ge-Peng Ji; Salman Khan; Deng-Ping Fan; Fahad Shahbaz; Khan; Luc Van Gool

arXiv:2307.15016·cs.CV·August 31, 2023·2 cites

How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges

Haotong Qin, Ge-Peng Ji, Salman Khan, Deng-Ping Fan, Fahad Shahbaz, Khan, Luc Van Gool

PDF

Open Access 1 Repo

TL;DR

This paper empirically evaluates Google's Bard AI's ability to understand and interpret visual data across diverse challenging scenarios, revealing significant gaps in its current multi-modal visual understanding capabilities.

Contribution

It provides a comprehensive assessment of Bard's visual understanding performance across 15 diverse tasks, highlighting current limitations and future challenges for multi-modal AI models.

Findings

01

Bard struggles with complex visual tasks

02

Significant gap in vision-based understanding remains

03

Evaluation across diverse data types highlights challenges

Abstract

Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI. Notably, Bard has recently been updated to handle visual inputs alongside text prompts during conversations. Given Bard's impressive track record in handling textual inputs, we explore its capabilities in understanding and interpreting visual data (images) conditioned by text questions. This exploration holds the potential to unveil new insights and challenges for Bard and other forthcoming multi-modal Generative models, especially in addressing complex computer vision problems that demand accurate visual and language understanding. Specifically, in this study, we focus on 15 diverse task scenarios encompassing regular, camouflaged, medical, under-water and remote sensing data to comprehensively evaluate Bard's performance. Our primary finding indicates that Bard still struggles…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

htqin/googlebard-visunderstand
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Subtitles and Audiovisual Media

MethodsFocus