MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark
Elliot L. Epstein, Kaisheng Yao, Jing Li, Xinyi Bai, Hamid, Palangi

TL;DR
This paper introduces MMMT-IF, a challenging multimodal multi-turn instruction following benchmark with an objective, code-verifiable evaluation metric, revealing current models' limitations in instruction retrieval and adherence over long dialogues.
Contribution
The paper presents MMMT-IF, a novel multimodal multi-turn instruction following benchmark with a new metric, PIF, for objective evaluation of instruction adherence in complex dialogue settings.
Findings
Models' instruction following drops significantly over turns.
Appending instructions to input improves performance.
PIF correlates well with human ratings.
Abstract
Evaluating instruction following capabilities for multimodal, multi-turn dialogue is challenging. With potentially multiple instructions in the input model context, the task is time-consuming for human raters and we show LLM based judges are biased towards answers from the same model. We propose MMMT-IF, an image based multi-turn QA evaluation set with added global instructions between questions, constraining the answer format. This challenges models to retrieve instructions dispersed across long dialogues and reason under instruction constraints. All instructions are objectively verifiable through code execution. We introduce the Programmatic Instruction Following () metric to measure the fraction of the instructions that are correctly followed while performing a reasoning task. The set of metrics further evaluates robustness by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
