MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence
Yifan Chen, Fei Yin, Qingyan Bai, Zicheng Lin, Yujiu Yang

TL;DR
MMCL-Bench is a new benchmark designed to evaluate multimodal context learning, challenging models to recover and reason over visual evidence across diverse tasks, revealing significant gaps in current system capabilities.
Contribution
This paper introduces MMCL-Bench, a comprehensive benchmark for multimodal context learning from visual data, highlighting current models' limitations and guiding future research.
Findings
Current models solve fewer than one-third of tasks under strict evaluation.
Failures occur across context anchoring, evidence extraction, reasoning, and response construction.
MMCL-Bench exposes critical bottlenecks in multimodal context learning capabilities.
Abstract
We introduce MMCL-Bench, a benchmark for multimodal context learning: learning task-local rules, procedures, and empirical patterns from visual or mixed-modality teaching context and applying them to new visual instances. Unlike text-only context learning or standard multimodal question answering, this setting requires models to recover and localize relevant evidence from images, screenshots, manuals, videos, and frame sequences before they can reason over the learned context. MMCL-Bench contains 102 tasks spanning three categories: rule system application, procedural task execution, and empirical discovery and induction. We evaluate frontier multimodal models with strict rubric-based scoring and find that current systems remain far from robust multimodal context learning, with even the strongest model solving fewer than one-third of tasks under strict evaluation. Diagnostic ablations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
