MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark
Bin Shan, Xiang Fei, Wei Shi, An-Lan Wang, Guozhi Tang, Lei Liao,, Jingqun Tang, Xiang Bai, Can Huang

TL;DR
MCTBench is a new benchmark designed to evaluate the cognitive abilities of multimodal large language models in understanding and creating content in text-rich visual scenes, addressing a gap in current perceptual-focused assessments.
Contribution
This paper introduces MCTBench, a comprehensive benchmark that assesses both perceptual and cognitive skills of MLLMs in text-rich visual scenes, including an automatic evaluation pipeline.
Findings
MLLMs excel in perceptual tasks but need improvement in cognitive abilities.
MCTBench provides a balanced evaluation of perception and cognition.
The benchmark facilitates fair comparison across models.
Abstract
The comprehension of text-rich visual scenes has become a focal point for evaluating Multi-modal Large Language Models (MLLMs) due to their widespread applications. Current benchmarks tailored to the scenario emphasize perceptual capabilities, while overlooking the assessment of cognitive abilities. To address this limitation, we introduce a Multimodal benchmark towards Text-rich visual scenes, to evaluate the Cognitive capabilities of MLLMs through visual reasoning and content-creation tasks (MCTBench). To mitigate potential evaluation bias from the varying distributions of datasets, MCTBench incorporates several perception tasks (e.g., scene text recognition) to ensure a consistent comparison of both the cognitive and perceptual capabilities of MLLMs. To improve the efficiency and fairness of content-creation evaluation, we conduct an automatic evaluation pipeline. Evaluations of…
Peer Reviews
Decision·Submitted to ICLR 2025
1)MCTBench provides a thorough assessment of both reasoning and content-creation capabilities in MLLMs, offering a well-rounded evaluation framework. 2)By using advanced MLLMs for automated evaluation, the benchmark reduces the need for costly and subjective human assessments in content creation tasks. 3)By distinguishing between perception and cognitive tasks, MCTBench helps identify specific areas where MLLMs need improvement.The finding that larger models perform better in cognitive tasks pro
1)Incomplete paper with no content in section 3.3 2)Segmenting cognitive abilities into reasoning and content generation may not be enough, and a sufficiently fine-grained benchmark would require a more precise segmentation of the data 3)Automated evaluations have improved efficiency, but their accuracy and consistency with manual evaluations need further validation
1. This paper collect a large-scale benchmark for evaluating the cognitive capability for MLLM, where reasoning and content-creation ability is highlighted 2. For the content-creation task, an automated evaluation pipeline is introduced to enhance efficiency.
1. The paper introduces Content Creation as a new evaluation component, but it could benefit from a clearer explanation of the necessity and value of this addition for assessing cognitive abilities. Furthermore, the rationale behind dividing cognitive tasks into “reasoning” and “content creation” would be strengthened with additional justification for this categorization. 2. The paper suggests that MLLMs require improvements in cognitive capabilities within text-rich visual scenes. However, the
1. This paper broadens the scope of OCR ability of MLLMs, rather than conventional OCR tasks and current MLLM benchmarks. 2. The benchmark is large-scale and human-annotated, make the benchmark valid and reliable.
1. Paper is poorly formatted. 2. Paper lacks details. See questions below.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques
