MAC: A Live Benchmark for Multimodal Large Language Models in Scientific Understanding
Mohan Jiang, Jin Gao, Jiahao Zhan, Dequan Wang

TL;DR
The paper introduces MAC, a dynamic benchmark for evaluating multimodal large language models' scientific understanding, highlighting current limitations and proposing a new inference method to improve reasoning capabilities.
Contribution
It presents MAC, a live, evolving benchmark for scientific understanding in MLLMs, and introduces DAD, a novel inference technique to enhance reasoning performance.
Findings
MLLMs show strong perceptual abilities but limited scientific reasoning.
DAD improves MLLM reasoning performance by up to 11%.
MAC remains adaptable to scientific progress and model updates.
Abstract
As multimodal large language models (MLLMs) grow increasingly capable, fixed benchmarks are gradually losing their effectiveness in evaluating high-level scientific understanding. In this paper, we introduce the Multimodal Academic Cover benchmark (MAC), a live benchmark that could continuously evolve with scientific advancement and model progress. MAC leverages over 25,000 image-text pairs sourced from issues of top-tier scientific journals such as Nature, Science, and Cell, challenging MLLMs to reason across abstract visual and textual scientific content. Experiments on our most recent yearly snapshot, MAC-2025, reveal that while MLLMs demonstrate strong perceptual abilities, their cross-modal scientific reasoning remains limited. To bridge this gap, we propose DAD, a lightweight inference-time approach that enhances MLLMs by extending MLLM visual features with language space…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
