LLaVA-Critic: Learning to Evaluate Multimodal Models

Tianyi Xiong; Xiyao Wang; Dong Guo; Qinghao Ye; Haoqi Fan; Quanquan; Gu; Heng Huang; Chunyuan Li

arXiv:2410.02712·cs.CV·March 5, 2025·2 cites

LLaVA-Critic: Learning to Evaluate Multimodal Models

Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan, Gu, Heng Huang, Chunyuan Li

PDF

Open Access 4 Models 1 Datasets 4 Reviews

TL;DR

LLaVA-Critic is an open-source multimodal model that evaluates and guides the performance of other multimodal models, achieving competitive evaluation accuracy and improving model alignment through preference learning.

Contribution

It introduces the first open-source multimodal evaluator capable of assessing diverse tasks and providing reward signals for preference learning.

Findings

01

Performs on par or better than GPT models in evaluation benchmarks.

02

Effectively generates reward signals for model alignment.

03

Demonstrates the potential of open-source LMMs in self-critique.

Abstract

We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: (1) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (2) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 3

Strengths

1. Taking LMM as a critic is a very important research direction. It can be used to further fine-tune the LMM. 2. The authors open source the critic instruction data, codebase, and model checkpoints, which is a big contribution to the community.

Weaknesses

From the methodology perspective, I do not get much insight. It seems this is engineering work, gathering the large-scale dataset and training a large model with lots of resources. I would give an accept if this is a benchmark setting. But for the main track, the novelty is not sufficient.

Reviewer 02Rating 5Confidence 4

Strengths

The presentation of the paper is good, with authors being clear and detailed about the procedure being followed. For example, the references to previous datasets, prompts are either well cited in the main paper or detailed in the appendix. Figure 1 also helps understanding what constitutes the majority of the training dataset. The paper mentions that their method is open-source, which indicates that the weights will be released - a net positive for the community. Most of the claims in the empiri

Weaknesses

The strongest concern about the paper is that there is nothing that is particularly surprising, or insightful. In essence, the procedure describes entails to distilling GPT-4o into an open-source model using standard cross-entropy, and showing performance that is close to that of GPT-4o. Although this is great for the community, I am struggling to see what the contribution is from a scientific perspective. I can see how this paper would be valuable within a Datasets & Benchmarks track as the one

Reviewer 03Rating 6Confidence 3

Strengths

The results seem strong: good performance on a series of benchmarks. The paper is written clearly, explaining the data collection process well.

Weaknesses

The paper could benefit from some ablation studies to offer insights. For example, it was mentioned "We randomly select 20k pairs where the average score gap between responses is greater than 0.6. Besides, to ensure diversity in the preferences, we randomly sample 5k pairs where the two responses had identical scores across all three dimensions to serve as “Tie” training data." Why was this decision made? What are their effects on the final model artifacts? Table 4 and Table 5 might benefit fr

Reviewer 04Rating 3Confidence 3

Strengths

Enhancing large models' evaluation capabilities in multimodal scenarios is the central focus of this paper. This paper improves the evaluation ability of an existing large model by fine-tuning it on a high-quality evaluation dataset, resulting in a new multimodal large language model. The paper is well-written, with smooth flow and well-supported arguments, making it easy to follow.

Weaknesses

However, this paper does not have enough innovation. It only opens up the relevant datasets and uses them to finetune a new multimodal large model, and there is no new methodological innovation. The experimental section is also lacking in comprehensive baselines. The differences in pointwise scoring and pairwise ranking capabilities between LLaVA-Critic and existing multimodal large language models are not compared, as LLaVA itself is not currently the most capable multimodal model. Further expe

Code & Models

Models

Datasets

lmms-lab/llava-critic-113k
dataset· 2.3k dl
2.3k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Multi-Agent Systems and Negotiation

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Residual Connection · Weight Decay · Cosine Annealing · Dropout · Byte Pair Encoding · Softmax · Attention Dropout