MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures
Jinjie Ni, Yifan Song, Deepanway Ghosal, Bo Li, David Junhao Zhang,, Xiang Yue, Fuzhao Xue, Zian Zheng, Kaichen Zhang, Mahir Shah, Kabir Jain,, Yang You, Michael Shieh

TL;DR
MixEval-X is a comprehensive, standardized benchmark for evaluating AI models across diverse modalities, addressing current inconsistencies and biases to better reflect real-world performance.
Contribution
It introduces the first any-to-any real-world multi-modal benchmark with adaptation pipelines, improving evaluation reliability and correlation with real-world outcomes.
Findings
High correlation (up to 0.98) with crowd-sourced evaluations.
Effective alignment of benchmark samples with real-world task distributions.
Enhanced evaluation efficiency and standardization.
Abstract
Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any, real-world benchmark designed to optimize and standardize evaluations across diverse input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications · Time Series Analysis and Forecasting
