HEMM: Holistic Evaluation of Multimodal Foundation Models
Paul Pu Liang, Akshay Goindani, Talha Chafekar, Leena Mathur, Haofei, Yu, Ruslan Salakhutdinov, Louis-Philippe Morency

TL;DR
This paper introduces HEMM, a comprehensive framework for evaluating multimodal foundation models across skills, information flow, and real-world applications, providing insights into their capabilities and challenges.
Contribution
It presents a systematic evaluation method for multimodal models, covering diverse skills, information dynamics, and use cases, and analyzes how various modeling choices affect performance.
Findings
Identifies key dataset dimensions challenging current models
Shows scale and data influence performance significantly
Highlights importance of instruction tuning for multimodal tasks
Abstract
Multimodal foundation models that can holistically process text alongside images, video, audio, and other sensory modalities are increasingly used in a variety of real-world applications. However, it is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) to systematically evaluate the capabilities of multimodal foundation models across a set of 3 dimensions: basic skills, information flow, and real-world use cases. Basic multimodal skills are internal abilities required to solve problems, such as learning interactions across modalities, fine-grained alignment, multi-step reasoning, and the ability to handle external knowledge. Information flow studies how multimodal content changes during a task through querying,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBIM and Construction Integration
MethodsSparse Evolutionary Training
