m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks

Zixian Ma; Weikai Huang; Jieyu Zhang; Tanmay Gupta; Ranjay Krishna

arXiv:2403.11085·cs.CV·September 24, 2024·2 cites

m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks

Zixian Ma, Weikai Huang, Jieyu Zhang, Tanmay Gupta, Ranjay Krishna

PDF

Open Access 2 Repos 1 Datasets

TL;DR

This paper introduces m&m's, a comprehensive benchmark with over 4,000 multi-step multi-modal tasks and tools, to evaluate and improve LLM-based planning strategies for complex real-world problems.

Contribution

It provides a large, annotated dataset and systematic evaluation framework for assessing LLMs as multi-step multi-modal task planners, addressing a key gap in the field.

Findings

01

Multi-step planning generally outperforms single-shot planning.

02

Structured data formats like JSON improve plan clarity and execution.

03

Feedback mechanisms enhance planning accuracy.

Abstract

Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold tremendous promise for automating the generation of such computational plans. However, the lack of standardized benchmarks for evaluating LLMs as planners for multi-step multi-modal tasks has prevented a systematic study of planner design decisions. Should LLMs generate a full plan in a single shot or step-by-step? Should they invoke tools directly with Python code or through structured data formats like JSON? Does feedback improve planning? To answer these questions and more, we introduce m&m's: a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal models, (free) public APIs, and image processing modules. For each of these task queries, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

zixianma/mnms
dataset· 170 dl
170 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems