Arabic Prompts with English Tools: A Benchmark
Konstantin Kubrak, Ahmed El-Moselhy, Ammar Alsulami, Remaz Altuwaim, Hassan Ismail Fawaz, Faisal Alsaby

TL;DR
This paper introduces the first benchmark for evaluating Arabic-language LLMs' tool-calling and agentic capabilities, revealing significant performance gaps and aiming to improve AI reliability for Arabic speakers.
Contribution
It provides a standardized framework for assessing Arabic LLMs' tool use and agentic functions, addressing a critical gap in multilingual AI evaluation.
Findings
Tool-calling accuracy drops by 5-10% in Arabic interactions.
Models perform worse on Arabic prompts regardless of tool language.
Benchmark highlights the need for more linguistically equitable AI development.
Abstract
Large Language Models (LLMs) are now integral to numerous industries, increasingly serving as the core reasoning engine for autonomous agents that perform complex tasks through tool-use. While the development of Arabic-native LLMs is accelerating, the benchmarks for evaluating their capabilities lag behind, with most existing frameworks focusing on English. A critical and overlooked area is tool-calling, where the performance of models prompted in non-English languages like Arabic is poorly understood, especially since these models are often pretrained on predominantly English data. This paper addresses this critical gap by introducing the first dedicated benchmark for evaluating the tool-calling and agentic capabilities of LLMs in the Arabic language. Our work provides a standardized framework to measure the functional accuracy and robustness of models in Arabic agentic workflows. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
