IWISDM: Assessing instruction following in multimodal models at scale

Xiaoxuan Lei; Lucas Gomez; Hao Yuan Bai; Pouya Bashivan

arXiv:2406.14343·cs.AI·July 23, 2024

IWISDM: Assessing instruction following in multimodal models at scale

Xiaoxuan Lei, Lucas Gomez, Hao Yuan Bai, Pouya Bashivan

PDF

Open Access 1 Repo

TL;DR

This paper introduces iWISDM, a comprehensive benchmark environment for evaluating multimodal models' ability to follow complex visual and language instructions, revealing significant gaps compared to human performance.

Contribution

The paper presents iWISDM, a new scalable environment and benchmark suite for assessing instruction-following in multimodal models across diverse, complex vision-language tasks.

Findings

01

iWISDM effectively evaluates instruction-following in multimodal models

02

Current models lag behind human performance in instruction adherence

03

The benchmark reveals significant gaps in multimodal model capabilities

Abstract

The ability to perform complex tasks from detailed instructions is a key to many remarkable achievements of our species. As humans, we are not only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs (either text or vision), narrowing the scope of multimodal assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bashivanlab/iwisdm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEFL/ESL Teaching and Learning · Second Language Acquisition and Learning · Second Language Learning and Teaching