IWISDM: Assessing instruction following in multimodal models at scale
Xiaoxuan Lei, Lucas Gomez, Hao Yuan Bai, Pouya Bashivan

TL;DR
This paper introduces iWISDM, a comprehensive benchmark environment for evaluating multimodal models' ability to follow complex visual and language instructions, revealing significant gaps compared to human performance.
Contribution
The paper presents iWISDM, a new scalable environment and benchmark suite for assessing instruction-following in multimodal models across diverse, complex vision-language tasks.
Findings
iWISDM effectively evaluates instruction-following in multimodal models
Current models lag behind human performance in instruction adherence
The benchmark reveals significant gaps in multimodal model capabilities
Abstract
The ability to perform complex tasks from detailed instructions is a key to many remarkable achievements of our species. As humans, we are not only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs (either text or vision), narrowing the scope of multimodal assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEFL/ESL Teaching and Learning · Second Language Acquisition and Learning · Second Language Learning and Teaching
