TL;DR
This paper introduces X-Value, a comprehensive benchmark for evaluating large language models' ability to judge deep-level values across multiple languages, addressing a critical gap in multilingual AI evaluation.
Contribution
It presents a novel two-stage human-AI annotation framework and the first cross-lingual values judgment benchmark, X-Value, covering 14 languages and 7 global issue categories.
Findings
LLMs show limitations in cross-lingual values judgment accuracy.
Performance varies significantly across languages and issue categories.
Current models need improvement in values-aware content evaluation.
Abstract
As large language models (LLMs) are employed worldwide, existing evaluation paradigms for their multilingual capabilities primarily focus on factual task performance, neglecting the ability to judge content's deep-level values across multiple languages. To bridge this gap, we first reveal two primary challenges in constructing values judgment benchmarks, cultural diversity and disciplinary complexity, and propose a novel two-stage human-AI collaborative annotation framework to alleviate them. This framework identifies the issue scope and nature, establishes specific annotation criteria, and utilizes multiple LLMs for final review. Building upon this framework, we introduce \textbf{X-Value}, the first \textit{Cross-lingual Values Judgment Benchmark} designed to evaluate the capability of LLMs in judging deep-level values of content. X-Value comprises 4,750 Question-Answer pairs across 14…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
