mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning
Nghia Trung Ngo, Franck Dernoncourt, Thien Huu Nguyen

TL;DR
mSCoRe is a comprehensive multilingual benchmark designed to evaluate and analyze the reasoning skills of large language models in complex, culturally diverse commonsense reasoning tasks, highlighting current limitations.
Contribution
This paper introduces mSCoRe, a novel scalable benchmark with a reasoning skill taxonomy, data synthesis pipeline, and complexity framework for multilingual commonsense reasoning evaluation.
Findings
Current LLMs find mSCoRe significantly challenging, especially at higher complexity levels.
Models show limitations in handling nuanced multilingual and cultural commonsense reasoning.
Detailed analysis suggests directions for future improvements in reasoning capabilities.
Abstract
Recent advancements in reasoning-reinforced Large Language Models (LLMs) have shown remarkable capabilities in complex reasoning tasks. However, the mechanism underlying their utilization of different human reasoning skills remains poorly investigated, especially for multilingual commonsense reasoning that involves everyday knowledge across different languages and cultures. To address this gap, we propose a \textbf{M}ultilingual and Scalable Benchmark for \textbf{S}kill-based \textbf{Co}mmonsense \textbf{Re}asoning (\textbf{mSCoRe}). Our benchmark incorporates three key components that are designed to systematically evaluate LLM's reasoning capabilities, including: (1) a novel taxonomy of reasoning skills that enables fine-grained analysis of models' reasoning processes, (2) a robust data synthesis pipeline tailored specifically for commonsense reasoning evaluation, and (3) a complexity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
