MAC: A Live Benchmark for Multimodal Large Language Models in Scientific Understanding

Mohan Jiang; Jin Gao; Jiahao Zhan; Dequan Wang

arXiv:2508.15802·cs.CL·August 25, 2025

MAC: A Live Benchmark for Multimodal Large Language Models in Scientific Understanding

Mohan Jiang, Jin Gao, Jiahao Zhan, Dequan Wang

PDF

1 Datasets

TL;DR

The paper introduces MAC, a dynamic benchmark for evaluating multimodal large language models' scientific understanding, highlighting current limitations and proposing a new inference method to improve reasoning capabilities.

Contribution

It presents MAC, a live, evolving benchmark for scientific understanding in MLLMs, and introduces DAD, a novel inference technique to enhance reasoning performance.

Findings

01

MLLMs show strong perceptual abilities but limited scientific reasoning.

02

DAD improves MLLM reasoning performance by up to 11%.

03

MAC remains adaptable to scientific progress and model updates.

Abstract

As multimodal large language models (MLLMs) grow increasingly capable, fixed benchmarks are gradually losing their effectiveness in evaluating high-level scientific understanding. In this paper, we introduce the Multimodal Academic Cover benchmark (MAC), a live benchmark that could continuously evolve with scientific advancement and model progress. MAC leverages over 25,000 image-text pairs sourced from issues of top-tier scientific journals such as Nature, Science, and Cell, challenging MLLMs to reason across abstract visual and textual scientific content. Experiments on our most recent yearly snapshot, MAC-2025, reveal that while MLLMs demonstrate strong perceptual abilities, their cross-modal scientific reasoning remains limited. To bridge this gap, we propose DAD, a lightweight inference-time approach that enhances MLLMs by extending MLLM visual features with language space…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

mhjiang0408/MAC_Bench
dataset· 35 dl
35 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.