HighlightBench: Benchmarking Markup-Driven Table Reasoning in Scientific Documents
Lexin Wang, Shenghua Liu, Yiwei Wang, Yujun Cai, Yuyao Ge, Jiayu Yao, Jiafeng Guo, Xueqi Cheng

TL;DR
HighlightBench is a new benchmark designed to evaluate how well multimodal language models understand and reason with visual markups in scientific tables, revealing their limitations in handling explicit visual cues.
Contribution
The paper introduces HighlightBench, a diagnostic benchmark with a reference pipeline for detailed evaluation of markup-driven table understanding in multimodal models.
Findings
Strong models show instability when reasoning with visual cues.
Benchmark decomposes tasks into five families for detailed analysis.
Reproducible baselines and error attribution are enabled.
Abstract
Visual markups such as highlights, underlines, and bold text are common in table-centric documents. Although multimodal large language models (MLLMs) have made substantial progress in document understanding, their ability to treat such cues as explicit logical directives remains under-explored. More importantly, existing evaluations cannot distinguish whether a model fails to see the markup or fails to reason with it. This creates a key blind spot in assessing markup-conditioned behavior over tables. To address this gap, we introduce HighlightBench, a diagnostic benchmark for markup-driven table understanding that decomposes evaluation into five task families: Markup Grounding, Constrained Retrieval, Local Relations, Aggregation \& Comparison, and Consistency \& Missingness. We further provide a reference pipeline that makes intermediate decisions explicit, enabling reproducible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
