MPR-GUI: Benchmarking and Enhancing Multilingual Perception and Reasoning in GUI Agents

Ruihan Chen; Qiming Li; Xiaocheng Feng; Weihong Zhong; Xiaoliang Yang; Yuxuan Gu; Zekun Zhou; Yunfei Lu; Haoyu Ren; Kun Chen; Dandan Tu; Bing Qin

arXiv:2512.00756·cs.AI·April 29, 2026

MPR-GUI: Benchmarking and Enhancing Multilingual Perception and Reasoning in GUI Agents

Ruihan Chen, Qiming Li, Xiaocheng Feng, Weihong Zhong, Xiaoliang Yang, Yuxuan Gu, Zekun Zhou, Yunfei Lu, Haoyu Ren, Kun Chen, Dandan Tu, Bing Qin

PDF

1 Datasets

TL;DR

This paper introduces MPR-GUI-Bench, a benchmark for multilingual GUI perception and reasoning, and proposes GUI-XLI, a method to improve cross-lingual performance by aligning hidden states across languages.

Contribution

The paper presents a new multilingual benchmark with fine-grained diagnostics and an intervention method to enhance cross-lingual GUI agent capabilities.

Findings

01

Identified consistent perception and reasoning gaps between English and non-English settings.

02

Proposed GUI-XLI reduces cross-lingual performance gaps by an average of 6.5%.

03

Benchmark reveals reasoning-intensive tasks are particularly challenging in non-English languages.

Abstract

Large Vision-Language Models (LVLMs) have shown strong potential as multilingual Graphical User Interface (GUI) agents, as evidenced by existing GUI benchmarks. However, these benchmarks exhibit two primary limitations: (1) although Perception and Reasoning (P&R) capabilities are fundamental for GUI agents, current benchmarks lack fine-grained diagnostics to identify which specific capabilities lead to task failures, hindering targeted improvements; (2) existing benchmarks fail to provide a strictly aligned cross-lingual evaluation environment, introducing confounding factors that prevent isolating the language impact on GUI agent performance. To address these issues, we propose the Multilingual P&R GUI Benchmark (MPR-GUI-Bench), featuring strictly aligned environments across six languages and eight fine-grained P&R tasks. Our benchmark reveals consistent P&R gaps between English and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

chenruihan/MPR-GUI-Bench
dataset· 349 dl
349 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.