MINOS: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text
Junzhe Zhang, Huixuan Zhang, Xinyu Hu, Li Lin, Mingqi Gao, Shi Qiu, Xiaojun Wan

TL;DR
This paper introduces MINOS, a multimodal evaluation model trained on a high-quality dataset, achieving state-of-the-art performance in evaluating image-text generation tasks across diverse datasets.
Contribution
The paper presents a comprehensive, quality-controlled evaluation dataset and a new evaluation model that outperforms existing open-source models with less training data.
Findings
MINOS achieves state-of-the-art performance across 16 out-of-domain datasets.
Quality control and joint training on I2T and T2I data are crucial for evaluation accuracy.
The model remains competitive with closed-source evaluation models.
Abstract
Evaluation is important for multimodal generation tasks, while traditional multimodal evaluation metrics suffer from several limitations. With the rapid progress of MLLMs, there is growing interest in applying MLLMs to build general evaluation systems. However, existing researches often simply collect large-scale evaluation data for training, while overlooking the quality of evaluation data. What's more, current proposed evaluation models often struggle to achieve consistently strong performance across both image-to-text (I2T) and text-to-image (T2I) tasks. In this paper, through rigorous quality control strategies, we construct a comprehensive multimodal evaluation dataset, Minos-57K, with evaluation samples across 15 datasets, for developing the multimodal evaluation model Minos with SFT and preference alignment training strategies. Notably, despite using less than half the scale of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
