TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing   Prompts

Yuxuan Xie; Tianhua Li; Wenqi Shao; Kaipeng Zhang

arXiv:2410.18071·cs.CV·October 24, 2024

TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts

Yuxuan Xie, Tianhua Li, Wenqi Shao, Kaipeng Zhang

PDF

Open Access

TL;DR

This paper introduces TP-Eval, a novel evaluation framework for multimodal large language models that customizes prompts to reduce bias and better reveal models' true capabilities.

Contribution

The paper proposes a prompt customization method for MLLM evaluation, addressing prompt sensitivity and bias issues in existing benchmarks.

Findings

01

Prompt customization improves evaluation accuracy.

02

TP-Eval uncovers greater model capabilities.

03

Reduces evaluation bias across models.

Abstract

Recently, multimodal large language models (MLLMs) have received much attention for their impressive capabilities. The evaluation of MLLMs is becoming critical to analyzing attributes of MLLMs and providing valuable insights. However, current benchmarks overlook the problem of prompt sensitivity - minor prompt variations may lead to significant performance fluctuations. Thus, inappropriate prompts may obscure the models' capabilities, underestimating the models' performance. Moreover, different models have different preferences for different prompts, and thus, using the same prompt for all models will cause evaluation bias. This paper analyzes this deficiency in existing benchmarks and further introduces a new evaluation framework named TP-Eval, which introduces a prompt customization method to reduce evaluation biases and tap models' potential. TP-Eval will rewrite the original prompts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need