TL;DR
This paper introduces GAIA-v2-LILT, a multilingual extension of an agent benchmark, with a refined adaptation workflow that improves cross-lingual performance measurement accuracy.
Contribution
It proposes a new workflow for adapting English benchmarks into multiple languages with explicit alignment, reducing measurement errors and improving multilingual agent evaluation.
Findings
Workflow improves success rates by up to 32.7% over minimal translation.
Brings multilingual performance closer to English, within 3.1%.
Substantial performance gaps remain due to benchmark-induced measurement error.
Abstract
Agent benchmarks remain largely English-centric, while their multilingual versions are often built with machine translation (MT) and limited post-editing. We argue that, for agentic tasks, this minimal workflow can easily break benchmark validity through query-answer misalignment or culturally off-target context. We propose a refined workflow for adapting English benchmarks into multiple languages with explicit functional alignment, cultural alignment, and difficulty calibration using both automated checks and human review. Using this workflow, we introduce GAIA-v2-LILT, a re-audited multilingual extension of GAIA covering five non-English languages. In experiments, our workflow improves agent success rates by up to 32.7% over minimally translated versions, bringing the closest audited setting to within 3.1% of English performance while substantial gaps remain in many other cases. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
