HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

Zhaolu Kang; Junhao Gong; Jiaxu Yan; Wanke Xia; Yian Wang; Ziwen Wang; Huaxuan Ding; Zhuo Cheng; Wenhao Cao; Zhiyuan Feng; Siqi He; Shannan Yan; Junzhe Chen; Xiaomin He; Chaoya Jiang; Wei Ye; Kaidong Yu; Xuelong Li

arXiv:2506.03922·cs.CL·March 4, 2026

HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang, Huaxuan Ding, Zhuo Cheng, Wenhao Cao, Zhiyuan Feng, Siqi He, Shannan Yan, Junzhe Chen, Xiaomin He, Chaoya Jiang, Wei Ye, Kaidong Yu, Xuelong Li

PDF

Open Access 1 Repo 3 Reviews

TL;DR

HSSBench is a new benchmark designed to evaluate multimodal large language models on humanities and social sciences tasks, emphasizing interdisciplinary reasoning and knowledge integration across multiple languages.

Contribution

The paper introduces HSSBench, a comprehensive HSS-focused benchmark with a novel data generation pipeline, addressing a gap in existing MLLM evaluation methods.

Findings

01

Current MLLMs struggle with HSS tasks

02

HSSBench contains over 13,000 samples across six categories

03

Benchmarking reveals significant challenges for state-of-the-art models

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

**[+] Comprehensive and Challenging Benchmark:** The paper presents a valuable evaluation resource with strong empirical validation. The experimental results (particularly in Appendix C.5) convincingly demonstrate that HSSBench captures unique challenges in HSS domains. For instance, InternVL3-8B achieves 77.21% on MME and 68.10% on MMMU but only 42.14% on HSSBench-Art, showing that the benchmark effectively tests capabilities that existing benchmarks may not fully capture. **[+] Unprecedented

Weaknesses

**[-] Positioning and Novelty Claims Could Be More Precise:** I observed that the paper states existing benchmarks have been "overlooking" HSS domains and positions HSSBench as "addressing this gap." However, I note that established benchmarks like MMMU and CMMMU already include "Humanities & Social Science" and "Art & Design" as core evaluation categories. I believe the paper's true contribution lies in providing deeper granularity (45 subtypes vs. broader categories) and expanded multilingual

Reviewer 02Rating 6Confidence 3

Strengths

* **Addresses a Clear Gap:** The paper convincingly argues for the need for a benchmark beyond STEM, focusing on the challenging and underserved HSS domain. * **High-Quality Data Pipeline:** The 3-stage VGP, combining domain experts and agents, is a robust methodology. The validation step to ensure true multimodality (checking text-image dependencies) is a key strength. * **Insightful Analysis:** The findings are valuable, particularly that CoT can *increase* hallucinations on HSS tasks and tha

Weaknesses

* **Format Mismatch:** The paper's stated goal is to test "horizontal reasoning" (implying divergent thought and multiple interpretations), but the benchmark uses an MCQ format, which enforces a single correct "vertical" answer. * **Potential Cultural Bias:** The authors disclose that "most of the data experts... are Chinese". This poses a significant risk of cultural bias in a global benchmark, a limitation the authors concede may have skewed the results (e.g., Qwen outperforming GPT-4o in some

Reviewer 03Rating 8Confidence 2

Strengths

- The work compellingly argues for and addresses the lack of dedicated, in-depth benchmarks for HSS domains, which require different reasoning skills than typical STEM tasks (Section 1). - The paper proposes a sophisticated VQA Generation Pipeline (VGP) that leverages both domain experts and a multi-agent framework (Figure 3, Section 2). The multi-stage validation process (Section 2.3) ensures data quality and that questions are truly multimodal. - The paper provides useful qualitative analyses,

Weaknesses

- The group of human experts appears to have a strong majority of Chinese speakers (Table 2, Appendix A.2). While the authors acknowledge this may benefit certain models like Qwen (lines 362-365) and describe mitigation efforts (lines 818-827), this demographic skew could introduce subtle cultural biases into the dataset's content and framing, despite best efforts. - The multilingual aspect of the benchmark is created by translating an original set of questions using LLMs, followed by expert va

Code & Models

Repositories

Zhaolu-K/HSSBench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Computational and Text Analysis Methods · Topic Modeling