UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat
Omer Nacar

TL;DR
This paper evaluates the Arabic-centric ALLaM-34B language model through UI-level testing, demonstrating its strong performance across dialects, reasoning, and safety, highlighting its practical deployment readiness.
Contribution
It provides a comprehensive UI-level evaluation of ALLaM-34B, including new insights into dialect fidelity, reasoning, and safety performance in Arabic language tasks.
Findings
High performance in generation and code-switching tasks (average 4.92/5)
Strong results in Modern Standard Arabic handling (4.74/5)
Reliable safety performance (4.54/5)
Abstract
Large language models (LLMs) trained primarily on English corpora often struggle to capture the linguistic and cultural nuances of Arabic. To address this gap, the Saudi Data and AI Authority (SDAIA) introduced the family of Arabic-focused models. The most capable of these available to the public, , was subsequently adopted by HUMAIN, who developed and deployed HUMAIN Chat, a closed conversational web service built on this model. This paper presents an expanded and refined UI-level evaluation of . Using a prompt pack spanning modern standard Arabic, five regional dialects, code-switching, factual knowledge, arithmetic and temporal reasoning, creative generation, and adversarial safety, we collected 115 outputs (23 prompts times 5 runs) and scored each with three frontier LLM judges (GPT-5, Gemini 2.5 Pro, Claude Sonnet-4). We compute category-level means…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
