CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution
Ruiyang Xu, Jialun Cao, Yaojie Lu, Ming Wen, Hongyu Lin, Xianpei Han, Ben He, Shing-Chi Cheung, Le Sun

TL;DR
CRUXEval-X is a comprehensive multilingual code reasoning benchmark covering 19 programming languages, designed to evaluate LLMs' capabilities beyond Python, with an automated construction pipeline and insights into cross-language model performance.
Contribution
It introduces CRUXEval-X, a fully automated, multi-lingual code reasoning benchmark with 19 languages, addressing language and task biases in existing benchmarks.
Findings
TypeScript and JavaScript show strong positive correlation in LLM performance.
Models trained only on Python achieve limited accuracy in other languages.
Cross-language generalization of LLMs is limited, with max 34.4% Pass@1 outside Python.
Abstract
Code benchmarks such as HumanEval are widely adopted to evaluate Large Language Models' (LLMs) coding capabilities. However, there is an unignorable programming language bias in existing code benchmarks -- over 95% code generation benchmarks are dominated by Python, leaving the LLMs' capabilities in other programming languages such as Java and C/C++ unknown. Moreover, coding task bias is also crucial. Most benchmarks focus on code generation capability, while benchmarks for code reasoning (given input, reasoning output; and given output, reasoning input), an essential coding capability, are insufficient. Yet, constructing multi-lingual benchmarks can be expensive and labor-intensive, and codes in contest websites such as Leetcode suffer from data contamination during training. To fill this gap, we propose CRUXEVAL-X, a multi-lingual code reasoning benchmark that contains 19 programming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsFocus
