Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling

Alaa Elsetohy; Sama Hadhoud; Haryo Akbarianto Wibowo; Chenxi Whitehouse; Genta Indra Winata; Fajri Koto; Alham Fikri Aji

arXiv:2602.10732·cs.CL·April 21, 2026

Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling

Alaa Elsetohy, Sama Hadhoud, Haryo Akbarianto Wibowo, Chenxi Whitehouse, Genta Indra Winata, Fajri Koto, Alham Fikri Aji

PDF

1 Repo 1 Datasets

TL;DR

Macaron is a multilingual, multicultural reasoning benchmark using template-based questions to evaluate language models across diverse cultural contexts and reasoning types.

Contribution

It introduces a controlled, culturally grounded benchmark with 100 templates covering multiple reasoning types and languages, enabling systematic evaluation of multilingual models.

Findings

01

Reasoning models perform best at 80.8% overall accuracy.

02

Near-parity in performance between English and local languages.

03

Open-weight models struggle with local languages and T/F questions.

Abstract

Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100 language-agnostic templates that cover 7 reasoning types, 22 cultural aspects, native annotators create scenario-aligned English and local-language multiple-choice questions, and systematically derived True/False questions. Macaron contains 11,862 instances spanning 20 countries/cultural contexts, 10 scripts, and 20 languages and dialects (including low-resource ones like Amharic, Yoruba, Zulu, Kyrgyz, and some Arabic dialects). In zero-shot evaluation of 21 multilingual LLMs, reasoning-mode models achieve the strongest performance (80.8% overall)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/datasets/AlaaAhmed2444/Macaron
github

Datasets

AlaaAhmed2444/Macaron
dataset· 74 dl
74 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.