mCSQA: Multilingual Commonsense Reasoning Dataset with Unified Creation   Strategy by Language Models and Humans

Yusuke Sakai; Hidetaka Kamigaito; Taro Watanabe

arXiv:2406.04215·cs.CL·June 7, 2024·2 cites

mCSQA: Multilingual Commonsense Reasoning Dataset with Unified Creation Strategy by Language Models and Humans

Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces mCSQA, a multilingual commonsense reasoning dataset created efficiently using language models and humans, to evaluate and improve multilingual language understanding and transfer capabilities.

Contribution

It presents a novel, cost-effective method for constructing multilingual commonsense datasets leveraging language models, highlighting the importance of language-specific data for evaluation.

Findings

01

High transferability for easy questions in multilingual LMs

02

Lower transfer for questions requiring deep knowledge

03

Multilingual LMs can generate language-specific QA data

Abstract

It is very challenging to curate a dataset for language-specific knowledge and common sense in order to evaluate natural language understanding capabilities of language models. Due to the limitation in the availability of annotators, most current multilingual datasets are created through translation, which cannot evaluate such language-specific aspects. Therefore, we propose Multilingual CommonsenseQA (mCSQA) based on the construction process of CSQA but leveraging language models for a more efficient construction, e.g., by asking LM to generate questions/answers, refine answers and verify QAs followed by reduced human efforts for verification. Constructed dataset is a benchmark for cross-lingual language-transfer capabilities of multilingual LMs, and experimental results showed high language-transfer capabilities for questions that LMs could easily solve, but lower transfer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

yusuke1997/mCSQA
dataset· 647 dl
647 dl

Videos

mCSQA: Multilingual Commonsense Reasoning Dataset with Unified Creation Strategy by Language Models and Humans· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling