MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following   Benchmark

Elliot L. Epstein; Kaisheng Yao; Jing Li; Xinyi Bai; Hamid; Palangi

arXiv:2409.18216·cs.AI·September 30, 2024

MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark

Elliot L. Epstein, Kaisheng Yao, Jing Li, Xinyi Bai, Hamid, Palangi

PDF

Open Access

TL;DR

This paper introduces MMMT-IF, a challenging multimodal multi-turn instruction following benchmark with an objective, code-verifiable evaluation metric, revealing current models' limitations in instruction retrieval and adherence over long dialogues.

Contribution

The paper presents MMMT-IF, a novel multimodal multi-turn instruction following benchmark with a new metric, PIF, for objective evaluation of instruction adherence in complex dialogue settings.

Findings

01

Models' instruction following drops significantly over turns.

02

Appending instructions to input improves performance.

03

PIF correlates well with human ratings.

Abstract

Evaluating instruction following capabilities for multimodal, multi-turn dialogue is challenging. With potentially multiple instructions in the input model context, the task is time-consuming for human raters and we show LLM based judges are biased towards answers from the same model. We propose MMMT-IF, an image based multi-turn Q $&$ A evaluation set with added global instructions between questions, constraining the answer format. This challenges models to retrieve instructions dispersed across long dialogues and reason under instruction constraints. All instructions are objectively verifiable through code execution. We introduce the Programmatic Instruction Following ( $PIF$ ) metric to measure the fraction of the instructions that are correctly followed while performing a reasoning task. The $PIF-N-K$ set of metrics further evaluates robustness by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques

MethodsSparse Evolutionary Training