mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning

Nghia Trung Ngo; Franck Dernoncourt; Thien Huu Nguyen

arXiv:2508.10137·cs.CL·August 15, 2025

mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning

Nghia Trung Ngo, Franck Dernoncourt, Thien Huu Nguyen

PDF

TL;DR

mSCoRe is a comprehensive multilingual benchmark designed to evaluate and analyze the reasoning skills of large language models in complex, culturally diverse commonsense reasoning tasks, highlighting current limitations.

Contribution

This paper introduces mSCoRe, a novel scalable benchmark with a reasoning skill taxonomy, data synthesis pipeline, and complexity framework for multilingual commonsense reasoning evaluation.

Findings

01

Current LLMs find mSCoRe significantly challenging, especially at higher complexity levels.

02

Models show limitations in handling nuanced multilingual and cultural commonsense reasoning.

03

Detailed analysis suggests directions for future improvements in reasoning capabilities.

Abstract

Recent advancements in reasoning-reinforced Large Language Models (LLMs) have shown remarkable capabilities in complex reasoning tasks. However, the mechanism underlying their utilization of different human reasoning skills remains poorly investigated, especially for multilingual commonsense reasoning that involves everyday knowledge across different languages and cultures. To address this gap, we propose a \textbf{M}ultilingual and Scalable Benchmark for \textbf{S}kill-based \textbf{Co}mmonsense \textbf{Re}asoning (\textbf{mSCoRe}). Our benchmark incorporates three key components that are designed to systematically evaluate LLM's reasoning capabilities, including: (1) a novel taxonomy of reasoning skills that enables fine-grained analysis of models' reasoning processes, (2) a robust data synthesis pipeline tailored specifically for commonsense reasoning evaluation, and (3) a complexity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.