Mass-Producing Failures of Multimodal Systems with Language Models
Shengbang Tong, Erik Jones, Jacob Steinhardt

TL;DR
This paper introduces MultiMon, a system that automatically identifies systematic failures in multimodal systems by analyzing erroneous agreement patterns using language models, revealing widespread issues in popular models like CLIP and their derivatives.
Contribution
MultiMon is a novel automated approach that detects and describes systematic failures in multimodal systems using natural language prompts and large language models.
Findings
Identified 14 systematic failure patterns in CLIP.
Failures in CLIP propagate to systems like DALL-E and Midjourney.
MultiMon can target failures relevant to specific applications.
Abstract
Deployed multimodal systems can fail in ways that evaluators did not anticipate. In order to find these failures before deployment, we introduce MultiMon, a system that automatically identifies systematic failures -- generalizable, natural-language descriptions of patterns of model failures. To uncover systematic failures, MultiMon scrapes a corpus for examples of erroneous agreement: inputs that produce the same output, but should not. It then prompts a language model (e.g., GPT-4) to find systematic patterns of failure and describe them in natural language. We use MultiMon to find 14 systematic failures (e.g., "ignores quantifiers") of the CLIP text-encoder, each comprising hundreds of distinct inputs (e.g., "a shelf with a few/many books"). Because CLIP is the backbone for most state-of-the-art multimodal systems, these inputs produce failures in Midjourney 5.1, DALL-E, VideoFusion,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSoftware Engineering Research · Software System Performance and Reliability · Software Engineering Techniques and Practices
MethodsContrastive Language-Image Pre-training
