CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

Chaeyun Kim; YongTaek Lim; Kihyun Kim; Junghwan Kim; Minwoo Kim

arXiv:2602.20170·cs.CY·February 25, 2026

CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

Chaeyun Kim, YongTaek Lim, Kihyun Kim, Junghwan Kim, Minwoo Kim

PDF

Open Access 3 Reviews

TL;DR

CAGE is a framework that systematically adapts red-teaming prompts to different cultural contexts, improving the detection of socio-technical vulnerabilities in language models beyond simple translation methods.

Contribution

It introduces the Semantic Mold approach for disentangling adversarial structure from cultural content, enabling realistic, localized threat modeling in safety benchmarks.

Findings

01

KoRSET outperforms translation baselines in vulnerability detection.

02

CAGE provides scalable, culturally-aware safety benchmarks.

03

The dataset and evaluation tools are publicly available.

Abstract

Existing red-teaming benchmarks, when adapted to new languages via direct translation, fail to capture socio-technical vulnerabilities rooted in local culture and law, creating a critical blind spot in LLM safety evaluation. To address this gap, we introduce CAGE (Culturally Adaptive Generation), a framework that systematically adapts the adversarial intent of proven red-teaming prompts to new cultural contexts. At the core of CAGE is the Semantic Mold, a novel approach that disentangles a prompt's adversarial structure from its cultural content. This approach enables the modeling of realistic, localized threats rather than testing for simple jailbreaks. As a representative example, we demonstrate our framework by creating KoRSET, a Korean benchmark, which proves more effective at revealing vulnerabilities than direct translation baselines. CAGE offers a scalable solution for developing…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The experiment results show that the prompts generated by the proposed framework CAGE achieved higher ASR compared to translation based approach. 2. This work introduces a Korean red teaming dataset, KorSET. 3. The proposed framework is more friendly to low-resource language, as translation quality is lower compared to high-resource language.

Weaknesses

The proposed framework seems to need a lot of manual work such as the manual construction of taxonomy, and contury specific content gathering such as keywords and topics.

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper studies an interesting, timely, and important problem which can be exciting to the community. 2. Generating culturally aware red teaming benchmarks can benefit the community. The benchmark dataset that authors created in this paper (KoRSET) along with the taxonomy can be really useful. 3. Authors perform various ablations and use various models to study the effectiveness of their work.

Weaknesses

1. The paper is limited in its scope as it focuses on Korean only. It would be more interesting if authors could expand their studies across more diverse cultural contexts. 2. There were some terminologies that were used in the paper, such as mold and slot, that were not well defined in the beginning of the paper, so it may take sometime for the reader to understand what they really are after reading the paper and going over some examples. It would be nice if authors define these terms upfront i

Reviewer 03Rating 4Confidence 3

Strengths

- Novel semantic mold framework. The core idea of separating adversarial structure from cultural content through slot-based semantic molds is creative and provides a systematic approach to cultural adaptation. - Strong and consistent quantitative results. CAGE demonstrates substantial improvements over direct translation across all tested models and attack methods. The results are comprehensive, covering multiple models and four automated attack frameworks. - Demonstrated generalizability. The

Weaknesses

- Missing critical baselines and comparisons with existing cross-cultural adaptation methods. The paper discusses three categories of existing approaches in Section 2.3—direct translation, template adaptation (e.g., KoBBQ), and native construction (e.g., KorNAT)—and claims CAGE integrates their benefits while avoiding their limitations. However, experiments only compare against direct translation, the weakest baseline. Notably absent is comparison with template-based methods like KoBBQ, which al

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Hate Speech and Cyberbullying Detection