WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models
Yonatan Bitton, Nitzan Bitton Guetta, Ron Yosef, Yuval Elovici, Mohit, Bansal, Gabriel Stanovsky, Roy Schwartz

TL;DR
WinoGAViL introduces a gamified benchmark for vision-and-language models, challenging their commonsense reasoning through an online game that collects human-like associations, revealing current models' limitations.
Contribution
The paper presents WinoGAViL, a novel gamified benchmark for evaluating vision-and-language models' association and reasoning skills, along with a dataset and interactive platform.
Findings
Humans find the associations intuitive (>90% Jaccard index).
State-of-the-art models like ViLT score only 52%.
Associations require diverse reasoning skills.
Abstract
While vision-and-language models perform well on tasks such as visual question answering, they struggle when it comes to basic human commonsense reasoning skills. In this work, we introduce WinoGAViL: an online game of vision-and-language associations (e.g., between werewolves and a full moon), used as a dynamic evaluation benchmark. Inspired by the popular card game Codenames, a spymaster gives a textual cue related to several visual candidates, and another player tries to identify them. Human players are rewarded for creating associations that are challenging for a rival AI model but still solvable by other human players. We use the game to collect 3.5K instances, finding that they are intuitive for humans (>90% Jaccard index) but challenging for state-of-the-art AI models, where the best model (ViLT) achieves a score of 52%, succeeding mostly where the cue is visually salient. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
