Can MLLMs Understand the Deep Implication Behind Chinese Images?
Chenhao Zhang, Xi Feng, Yuelin Bai, Xinrun Du, Jinchang Hou, Kaixin, Deng, Guangzeng Han, Qinrui Li, Bingli Wang, Jiaheng Liu, Xingwei Qu, Yifei, Zhang, Qixuan Zhao, Yiming Liang, Ziqiang Liu, Feiteng Fang, Min Yang, Wenhao, Huang, Chenghua Lin, Ge Zhang, Shiwen Ni

TL;DR
This paper introduces CII-Bench, a new benchmark for evaluating Multimodal Large Language Models' understanding of Chinese images, especially traditional culture, revealing current limitations and potential improvements.
Contribution
The paper presents CII-Bench, a culturally authentic Chinese image benchmark, and provides extensive evaluation of MLLMs, highlighting their gaps in understanding Chinese high-level semantics and cultural content.
Findings
MLLMs achieve up to 64.4% accuracy, below human average of 78.2%.
Models perform worse on traditional Chinese culture images.
Incorporating emotional hints improves model accuracy.
Abstract
As the capabilities of Multimodal Large Language Models (MLLMs) continue to improve, the need for higher-order capability evaluation of MLLMs is increasing. However, there is a lack of work evaluating MLLM for higher-order perception and understanding of Chinese visual content. To fill the gap, we introduce the **C**hinese **I**mage **I**mplication understanding **Bench**mark, **CII-Bench**, which aims to assess the higher-order perception and understanding capabilities of MLLMs for Chinese images. CII-Bench stands out in several ways compared to existing benchmarks. Firstly, to ensure the authenticity of the Chinese context, images in CII-Bench are sourced from the Chinese Internet and manually reviewed, with corresponding answers also manually crafted. Additionally, CII-Bench incorporates images that represent Chinese traditional culture, such as famous Chinese traditional paintings,…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. Understanding the Chinese image implication is an interesting and high-level capacity for MLLMs. 2. High quality of the dataset. 3. Sufficient analysis of experimental results.
1. The dataset is small. 2. Multi-choice evaluation may not reveal the real capacity to understand the implications.
- This paper is well-written and easy to read. - Authors evaluate the performance of many different MLLMs. Generally speaking, the experiments are extensive.
- The scale of this dataset is a little small. CII-Bench only contains 698 images and 800 questions, which may not be comprehensive enough to evaluate the performance of MLLMs. - Some detailed information about the dataset should be provided. For example, the ratio of six different types of images. - The motivation is not strong enough. I think this work is just an extension of II-Bench [1]. So, to demonstrate the necessity of this paper, the authors should discuss or conclude the inconsistencie
(1) This paper is the first benchmark work to propose Chinese image representation understanding, which is of some help to the multimodal large language model for understanding Chinese images. (2) The paper comprehensively compares the capability of existing multimodal large language models.
**Weakness 1** It is mentioned in the paper that “in order to ensure the authenticity of the Chinese context, the pictures in CII-Bench are all from the Chinese Internet and have been manually reviewed, and the corresponding answers are also manually produced.” So the pictures are all from the Internet, and most of them are not real pictures, which greatly limits the development of Chinese language, and it is suggested that Chinese pictures from some real scenarios should be added. **Weakness 2
- CII-Bench is intriguing, and its construction process is clearly presented, offering value for the development of image implication understanding. - Evaluations are conducted on multiple open- and closed-source MLLMs, providing detailed analyses of CII-Bench from various perspectives.
- The proposed CII-Bench includes a greater emphasis on understanding the cultural and emotional content behind images. In this context, did the authors design more complex prompts to better guide the model's output? For instance, did they use background information and Chain-of-Thought (CoT) prompting to help the MLLM predict answers from the background context? - The English images presented in Fig.~1 are not convincing, as there are also complex and suggestive English images. The authors sho
1. This dataset is constructed using a rigorous pipeline that includes repeated image filtering and consistency checks, ensuring its high quality. 2. The number of models used for evaluation is extensive, encompassing both open-source and proprietary options, and we can observe a significant performance gap between different models. 3. Compared to most previous works, the prompting strategies used for evaluation are quite exhaustive, making the results highly informative and instructive. 4. This
1. The size of this dataset—698 images and 800 questions—is quite small, which may render the conclusions drawn from the evaluation results non-generalizable. 2. Since all the questions in this benchmark are multiple-choice, the output obtained from MLLM may be biased, as some models tend to favor specific choices. Therefore, the authors are encouraged to use the ``CircularEval`` in MMBench[^1] to ensure more robust results. 3. As shown in Table 1, text-only models, such as Qwen2-7B-Instruct, ca
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Digital Media Forensic Detection
MethodsBalanced Selection
