MAPS: A Multilingual Benchmark for Agent Performance and Security

Omer Hofman; Jonathan Brokman; Oren Rachmil; Shamik Bose; Vikas Pahuja; Toshiya Shimizu; Trisha Starostina; Kelly Marchisio; Seraphina Goldfarb-Tarrant; Roman Vainshtein

arXiv:2505.15935·cs.DB·February 11, 2026

MAPS: A Multilingual Benchmark for Agent Performance and Security

Omer Hofman, Jonathan Brokman, Oren Rachmil, Shamik Bose, Vikas Pahuja, Toshiya Shimizu, Trisha Starostina, Kelly Marchisio, Seraphina Goldfarb-Tarrant, Roman Vainshtein

PDF

Open Access 2 Datasets 2 Videos

TL;DR

MAPS is a comprehensive multilingual benchmark suite for evaluating agentic AI systems' performance and security across diverse languages and tasks, revealing performance degradation in non-English languages.

Contribution

This work introduces MAPS, the first standardized, multi-domain, security-aware benchmark suite for assessing multilingual agentic AI systems, filling a critical evaluation gap.

Findings

01

Performance drops in non-English languages

02

Security vulnerabilities increase outside English

03

Degradation varies by task and translation quality

Abstract

Agentic AI systems, which build on Large Language Models (LLMs) and interact with tools and memory, have rapidly advanced in capability and scope. Yet, since LLMs have been shown to struggle in multilingual settings, typically resulting in lower performance and reduced safety, agentic systems risk inheriting these limitations. This raises concerns about the accessibility of such systems, as users interacting in languages other than English may encounter unreliable or security-critical agent behavior. Despite growing interest in evaluating agentic AI and recent initial efforts toward multilingual interaction, existing benchmarks do not yet provide a comprehensive, multi-domain, security-aware evaluation of multilingual agentic systems. To address this gap, we propose MAPS, a multilingual benchmark suite designed to evaluate agentic AI systems across diverse languages and tasks. MAPS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

MAPS: A Multilingual Benchmark for Agent Performance and Security· underline

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning

MethodsFocus