Automatic benchmarking of large multimodal models via iterative   experiment programming

Alessandro Conti; Enrico Fini; Paolo Rota; Yiming Wang; Massimiliano; Mancini; Elisa Ricci

arXiv:2406.12321·cs.AI·June 19, 2024

Automatic benchmarking of large multimodal models via iterative experiment programming

Alessandro Conti, Enrico Fini, Paolo Rota, Yiming Wang, Massimiliano, Mancini, Elisa Ricci

PDF

Open Access 1 Repo

TL;DR

APEx automates the benchmarking of large multimodal models by using a large language model to generate, execute, and refine experiments based on natural language research questions, reducing manual effort and increasing flexibility.

Contribution

This paper introduces APEx, the first framework for automatic benchmarking of large multimodal models using iterative experiment programming with LLMs.

Findings

01

APEx successfully reproduces existing study findings.

02

It enables arbitrary analyses and hypothesis testing.

03

The framework is modular and extensible.

Abstract

Assessing the capabilities of large multimodal models (LMMs) often requires the creation of ad-hoc evaluations. Currently, building new benchmarks requires tremendous amounts of manual work for each specific analysis. This makes the evaluation process tedious and costly. In this paper, we present APEx, Automatic Programming of Experiments, the first framework for automatic benchmarking of LMMs. Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand, and progressively compile a scientific report. The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions. Finally, the LLM refines the report, presenting the results to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

altndrr/apex
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Multi-Objective Optimization Algorithms · Control Systems and Identification

MethodsSparse Evolutionary Training · Lib