HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning

Xiaoyuan Li; Moxin Li; Rui Men; Yichang Zhang; Keqin Bao; Wenjie Wang; Fuli Feng; Dayiheng Liu; Junyang Lin

arXiv:2502.11393·cs.CL·May 27, 2025

HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning

Xiaoyuan Li, Moxin Li, Rui Men, Yichang Zhang, Keqin Bao, Wenjie Wang, Fuli Feng, Dayiheng Liu, Junyang Lin

PDF

Open Access

TL;DR

HellaSwag-Pro is a large bilingual benchmark designed to evaluate the robustness of large language models in commonsense reasoning, revealing significant vulnerabilities and language-dependent variations.

Contribution

This paper introduces HellaSwag-Pro, the first extensive bilingual benchmark for assessing LLM robustness in commonsense reasoning, including a new Chinese dataset and comprehensive experiments.

Findings

01

LLMs are not robust in commonsense reasoning

02

Robustness varies across languages

03

Benchmark provides valuable insights for future research

Abstract

Large language models (LLMs) have shown remarkable capabilities in commonsense reasoning; however, some variations in questions can trigger incorrect responses. Do these models truly understand commonsense knowledge, or just memorize expression patterns? To investigate this question, we present the first extensive robustness evaluation of LLMs in commonsense reasoning. We introduce HellaSwag-Pro, a large-scale bilingual benchmark consisting of 11,200 cases, by designing and compiling seven types of question variants. To construct this benchmark, we propose a two-stage method to develop Chinese HellaSwag, a finely annotated dataset comprising 12,000 instances across 56 categories. We conduct extensive experiments on 41 representative LLMs, revealing that these LLMs are far from robust in commonsense reasoning. Furthermore, this robustness varies depending on the language in which the LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques