HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models
Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang, Huaxuan Ding, Zhuo Cheng, Wenhao Cao, Zhiyuan Feng, Siqi He, Shannan Yan, Junzhe Chen, Xiaomin He, Chaoya Jiang, Wei Ye, Kaidong Yu, Xuelong Li

TL;DR
HSSBench is a new benchmark designed to evaluate multimodal large language models on humanities and social sciences tasks, emphasizing interdisciplinary reasoning and knowledge integration across multiple languages.
Contribution
The paper introduces HSSBench, a comprehensive HSS-focused benchmark with a novel data generation pipeline, addressing a gap in existing MLLM evaluation methods.
Findings
Current MLLMs struggle with HSS tasks
HSSBench contains over 13,000 samples across six categories
Benchmarking reveals significant challenges for state-of-the-art models
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for…
Peer Reviews
Decision·ICLR 2026 Poster
**[+] Comprehensive and Challenging Benchmark:** The paper presents a valuable evaluation resource with strong empirical validation. The experimental results (particularly in Appendix C.5) convincingly demonstrate that HSSBench captures unique challenges in HSS domains. For instance, InternVL3-8B achieves 77.21% on MME and 68.10% on MMMU but only 42.14% on HSSBench-Art, showing that the benchmark effectively tests capabilities that existing benchmarks may not fully capture. **[+] Unprecedented
**[-] Positioning and Novelty Claims Could Be More Precise:** I observed that the paper states existing benchmarks have been "overlooking" HSS domains and positions HSSBench as "addressing this gap." However, I note that established benchmarks like MMMU and CMMMU already include "Humanities & Social Science" and "Art & Design" as core evaluation categories. I believe the paper's true contribution lies in providing deeper granularity (45 subtypes vs. broader categories) and expanded multilingual
* **Addresses a Clear Gap:** The paper convincingly argues for the need for a benchmark beyond STEM, focusing on the challenging and underserved HSS domain. * **High-Quality Data Pipeline:** The 3-stage VGP, combining domain experts and agents, is a robust methodology. The validation step to ensure true multimodality (checking text-image dependencies) is a key strength. * **Insightful Analysis:** The findings are valuable, particularly that CoT can *increase* hallucinations on HSS tasks and tha
* **Format Mismatch:** The paper's stated goal is to test "horizontal reasoning" (implying divergent thought and multiple interpretations), but the benchmark uses an MCQ format, which enforces a single correct "vertical" answer. * **Potential Cultural Bias:** The authors disclose that "most of the data experts... are Chinese". This poses a significant risk of cultural bias in a global benchmark, a limitation the authors concede may have skewed the results (e.g., Qwen outperforming GPT-4o in some
- The work compellingly argues for and addresses the lack of dedicated, in-depth benchmarks for HSS domains, which require different reasoning skills than typical STEM tasks (Section 1). - The paper proposes a sophisticated VQA Generation Pipeline (VGP) that leverages both domain experts and a multi-agent framework (Figure 3, Section 2). The multi-stage validation process (Section 2.3) ensures data quality and that questions are truly multimodal. - The paper provides useful qualitative analyses,
- The group of human experts appears to have a strong majority of Chinese speakers (Table 2, Appendix A.2). While the authors acknowledge this may benefit certain models like Qwen (lines 362-365) and describe mitigation efforts (lines 818-827), this demographic skew could introduce subtle cultural biases into the dataset's content and framing, despite best efforts. - The multilingual aspect of the benchmark is created by translating an original set of questions using LLMs, followed by expert va
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Computational and Text Analysis Methods · Topic Modeling
