SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures

Nedjma Ousidhoum; Junho Myung; Carla Perez-Almendros; Jiho Jin; Amr Keleg; Meriem Beloucif; Yi Zhou; Rodrigo Agerri; Vladimir Araujo; Naomi Baes; James Barry; Joanne Boisson; Nancy F. Chen; Christine de Kock; Aleksandra Edwards; Joseba Fernandez de Landa; Mohamed Fazli Imam; Huda Hakami; Shu-Kai Hsieh; Joseph Marvin Imperial; Roy Ka-Wei Lee; Zhengyuan Liu; Chenyang Lyu; Younes Samih; Johan Sjons; Bryan Tan; Asahi Ushio; Weihua Zheng; Alice Oh; Jose Camacho-Collados

arXiv:2605.02601·cs.CL·May 5, 2026

SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures

Nedjma Ousidhoum, Junho Myung, Carla Perez-Almendros, Jiho Jin, Amr Keleg, Meriem Beloucif, Yi Zhou, Rodrigo Agerri, Vladimir Araujo, Naomi Baes, James Barry, Joanne Boisson, Nancy F. Chen, Christine de Kock, Aleksandra Edwards, Joseba Fernandez de Landa, Mohamed Fazli Imam

PDF

TL;DR

This paper introduces a shared evaluation task assessing multilingual and multicultural NLP systems' ability to handle low-resource languages using a new benchmark, with insights into system performance and challenges.

Contribution

It presents a novel multilingual, multicultural benchmark and an evaluation framework for assessing LLMs without training on the test data, highlighting diverse approaches and challenges.

Findings

01

Over 140 participants registered, with 62 teams submitting results.

02

Best systems demonstrated varied strategies for low-resource language evaluation.

03

Insights discussed include evaluation challenges and model behavior in under-represented cultures.

Abstract

We present our shared task on evaluating the adaptability of LLMs and NLP systems across multiple languages and cultures. The task data consist of an extended version of our manually constructed BLEnD benchmark (Myung et al. 2024), covering more than 30 language-culture pairs, predominantly representing low-resource languages spoken across multiple continents. As the task is designed strictly for evaluation, participants were not permitted to use the data for training, fine-tuning, few-shot learning, or any other form of model modification. Our task includes two tracks: (a) Short-Answer Questions (SAQ) and (b) Multiple-Choice Questions (MCQ). Participants were required to predict labels and were allowed to submit any NLP system and adopt diverse modelling strategies, provided that the benchmark was used solely for evaluation. The task attracted more than 140 registered participants, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.