LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model

Tao Sun; Oliver Liu; JinJin Li; Lan Ma

arXiv:2508.05602·cs.CV·August 8, 2025

LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model

Tao Sun, Oliver Liu, JinJin Li, Lan Ma

PDF

TL;DR

LLaVA-RE introduces a multimodal large language model-based framework for binary image-text relevancy evaluation, addressing diverse text formats and varying relevancy definitions across scenarios, validated by experimental results.

Contribution

It is the first to utilize MLLMs for binary image-text relevancy evaluation with detailed instructions and a new diverse dataset.

Findings

01

Effective in handling complex text formats

02

Accurate binary relevancy classification

03

Validated by comprehensive experiments

Abstract

Multimodal generative AI usually involves generating image or text responses given inputs in another modality. The evaluation of image-text relevancy is essential for measuring response quality or ranking candidate responses. In particular, binary relevancy evaluation, i.e., ``Relevant'' vs. ``Not Relevant'', is a fundamental problem. However, this is a challenging task considering that texts have diverse formats and the definition of relevancy varies in different scenarios. We find that Multimodal Large Language Models (MLLMs) are an ideal choice to build such evaluators, as they can flexibly handle complex text formats and take in additional task information. In this paper, we present LLaVA-RE, a first attempt for binary image-text relevancy evaluation with MLLM. It follows the LLaVA architecture and adopts detailed task instructions and multimodal in-context samples. In addition, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.