ShapeWorld - A new test methodology for multimodal language understanding
Alexander Kuhnle, Ann Copestake

TL;DR
ShapeWorld presents a new framework for evaluating multimodal language understanding by automatically generating controlled artificial data, enabling detailed assessment of models' generalization and reasoning abilities.
Contribution
It introduces a novel, controllable data generation methodology for testing multimodal models, emphasizing their generalization capabilities beyond existing benchmarks.
Findings
Models show varying generalization abilities across tasks
Framework provides detailed insights into model strengths and weaknesses
Open-sourcing encourages further research in multimodal understanding
Abstract
We introduce a novel framework for evaluating multimodal deep learning models with respect to their language understanding and generalization abilities. In this approach, artificial data is automatically generated according to the experimenter's specifications. The content of the data, both during training and evaluation, can be controlled in detail, which enables tasks to be created that require true generalization abilities, in particular the combination of previously introduced concepts in novel ways. We demonstrate the potential of our methodology by evaluating various visual question answering models on four different tasks, and show how our framework gives us detailed insights into their capabilities and limitations. By open-sourcing our framework, we hope to stimulate progress in the field of multimodal language understanding.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Speech and dialogue systems
