Eureka: Evaluating and Understanding Large Foundation Models
Vidhisha Balachandran, Jingya Chen, Neel Joshi, Besmira Nushi, Hamid, Palangi, Eduardo Salinas, Vibhav Vineet, James Woffinden-Luey, Safoora, Yousefi

TL;DR
Eureka provides a standardized, open-source framework and benchmark collection for evaluating large foundation models across diverse capabilities, revealing nuanced strengths and weaknesses beyond simple rankings.
Contribution
The paper introduces Eureka, an extensible evaluation framework and benchmark suite that addresses current challenges in assessing large models' capabilities.
Findings
Different models excel in different capabilities.
Current models still struggle with image understanding and factual grounding.
No single model is best across all evaluated capabilities.
Abstract
Rigorous and reproducible evaluation is critical for assessing the state of the art and for guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due to several reasons, including benchmark saturation, lack of transparency in methods used for measurement, development challenges in extracting measurements for generative tasks, and, more generally, the extensive number of capabilities required for a well-rounded comparison across models. We make three contributions to alleviate the above challenges. First, we present Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings. Second, we introduce Eureka-Bench as an extensible collection of benchmarks testing capabilities that (i) are still challenging for state-of-the-art models and (ii) represent fundamental but overlooked…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGeological Modeling and Analysis
