Mass-Producing Failures of Multimodal Systems with Language Models

Shengbang Tong; Erik Jones; Jacob Steinhardt

arXiv:2306.12105·cs.LG·March 6, 2024·5 cites

Mass-Producing Failures of Multimodal Systems with Language Models

Shengbang Tong, Erik Jones, Jacob Steinhardt

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces MultiMon, a system that automatically identifies systematic failures in multimodal systems by analyzing erroneous agreement patterns using language models, revealing widespread issues in popular models like CLIP and their derivatives.

Contribution

MultiMon is a novel automated approach that detects and describes systematic failures in multimodal systems using natural language prompts and large language models.

Findings

01

Identified 14 systematic failure patterns in CLIP.

02

Failures in CLIP propagate to systems like DALL-E and Midjourney.

03

MultiMon can target failures relevant to specific applications.

Abstract

Deployed multimodal systems can fail in ways that evaluators did not anticipate. In order to find these failures before deployment, we introduce MultiMon, a system that automatically identifies systematic failures -- generalizable, natural-language descriptions of patterns of model failures. To uncover systematic failures, MultiMon scrapes a corpus for examples of erroneous agreement: inputs that produce the same output, but should not. It then prompts a language model (e.g., GPT-4) to find systematic patterns of failure and describe them in natural language. We use MultiMon to find 14 systematic failures (e.g., "ignores quantifiers") of the CLIP text-encoder, each comprising hundreds of distinct inputs (e.g., "a shelf with a few/many books"). Because CLIP is the backbone for most state-of-the-art multimodal systems, these inputs produce failures in Midjourney 5.1, DALL-E, VideoFusion,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tsb0601/multimon
pytorchOfficial

Videos

Mass-Producing Failures of Multimodal Systems with Language Models· slideslive

Taxonomy

TopicsSoftware Engineering Research · Software System Performance and Reliability · Software Engineering Techniques and Practices

MethodsContrastive Language-Image Pre-training