Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering

Bowen Jiang; Runchuan Zhu; Jiang Wu; Zinco Jiang; Yifan He; Junyuan Gao; Jia Yu; Rui Min; Yinfan Wang; Haote Yang; Songyang Zhang; Dahua Lin; Lijun Wu; Conghui He

arXiv:2505.16591·cs.CL·May 23, 2025

Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering

Bowen Jiang, Runchuan Zhu, Jiang Wu, Zinco Jiang, Yifan He, Junyuan Gao, Jia Yu, Rui Min, Yinfan Wang, Haote Yang, Songyang Zhang, Dahua Lin, Lijun Wu, Conghui He

PDF

Open Access

TL;DR

KoLasSimpleQA is a comprehensive multilingual benchmark designed to evaluate the factual knowledge and self-awareness of Large Language Models across nine languages and two domains, highlighting performance gaps and guiding future improvements.

Contribution

This paper introduces KoLasSimpleQA, the first multilingual benchmark for factual ability of LLMs, covering multiple languages and domains for comprehensive evaluation.

Findings

01

Significant performance differences between general and language-specific domains.

02

Mainstream LLMs show varied accuracy and robustness across languages.

03

Benchmark facilitates targeted evaluation and model optimization in multilingual settings.

Abstract

We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs). Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability. These questions enable efficient evaluation using the LLM-as-judge paradigm, testing both the LLMs' factual memory and self-awareness ("know what they don't know"). KoLasSimpleQA expands existing research in two key dimensions: (1) Breadth (Multilingual Coverage): It includes 9 languages, supporting global applicability evaluation. (2) Depth (Dual Domain Design): It covers both the general domain (global facts) and the language-specific domain (such as history, culture, and regional traditions) for a comprehensive assessment of multilingual capabilities. We evaluated mainstream LLMs,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Natural Language Processing Techniques

MethodsSparse Evolutionary Training