Automatic benchmarking of large multimodal models via iterative experiment programming
Alessandro Conti, Enrico Fini, Paolo Rota, Yiming Wang, Massimiliano, Mancini, Elisa Ricci

TL;DR
APEx automates the benchmarking of large multimodal models by using a large language model to generate, execute, and refine experiments based on natural language research questions, reducing manual effort and increasing flexibility.
Contribution
This paper introduces APEx, the first framework for automatic benchmarking of large multimodal models using iterative experiment programming with LLMs.
Findings
APEx successfully reproduces existing study findings.
It enables arbitrary analyses and hypothesis testing.
The framework is modular and extensible.
Abstract
Assessing the capabilities of large multimodal models (LMMs) often requires the creation of ad-hoc evaluations. Currently, building new benchmarks requires tremendous amounts of manual work for each specific analysis. This makes the evaluation process tedious and costly. In this paper, we present APEx, Automatic Programming of Experiments, the first framework for automatic benchmarking of LMMs. Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand, and progressively compile a scientific report. The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions. Finally, the LLM refines the report, presenting the results to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Multi-Objective Optimization Algorithms · Control Systems and Identification
MethodsSparse Evolutionary Training · Lib
