Vibe-Eval: A hard evaluation suite for measuring progress of multimodal   language models

Piotr Padlewski; Max Bain; Matthew Henderson; Zhongkai Zhu; Nishant; Relan; Hai Pham; Donovan Ong; Kaloyan Aleksiev; Aitor Ormazabal; Samuel Phua,; Ethan Yeo; Eugenie Lamprecht; Qi Liu; Yuqi Wang; Eric Chen; Deyu Fu; Lei Li,; Che Zheng; Cyprien de Masson d'Autume; Dani Yogatama; Mikel Artetxe; Yi Tay

arXiv:2405.02287·cs.CL·May 6, 2024

Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant, Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua,, Ethan Yeo, Eugenie Lamprecht, Qi Liu, Yuqi Wang, Eric Chen, Deyu Fu, Lei Li,, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Vibe-Eval is a comprehensive benchmark designed to rigorously evaluate multimodal chat models through challenging prompts, revealing their limitations and providing tools for ongoing assessment.

Contribution

This paper introduces Vibe-Eval, a novel open benchmark with hard prompts and expert responses to evaluate and rank the capabilities of multimodal language models.

Findings

01

Over 50% of hard prompts are answered incorrectly by frontier models

02

Automatic evaluation correlates roughly with human judgment

03

Vibe-Eval provides a challenging and open-ended assessment framework

Abstract

We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. Vibe-Eval is open-ended and challenging with dual objectives: (i) vibe checking multimodal chat models for day-to-day tasks and (ii) rigorously testing and probing the capabilities of present frontier models. Notably, our hard set contains >50% questions that all frontier models answer incorrectly. We explore the nuances of designing, evaluating, and ranking models on ultra challenging prompts. We also discuss trade-offs between human and automatic evaluation, and show that automatic model evaluation using Reka Core roughly correlates to human judgment. We offer free API access for the purpose of lightweight evaluation and plan to conduct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

reka-ai/reka-vibe-eval
noneOfficial

Datasets

RekaAI/VibeEval
dataset· 622 dl
622 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsSparse Evolutionary Training