Lost in Translation? A Comparative Study on the Cross-Lingual Transfer of Composite Harms
Vaibhav Shukla, Hardik Sharma, Adith N Reganti, Soham Wasmatkar, Bagesh Kumar, Vrijendra Singh

TL;DR
This study evaluates how safety harms in large language models transfer across languages using a new multilingual benchmark, revealing significant challenges in maintaining safety standards in non-English languages.
Contribution
Introduces CompositeHarm, a multilingual benchmark combining adversarial and real-world harms, and analyzes safety transfer across six languages with scalable, energy-efficient evaluation methods.
Findings
Attack success rates increase in Indic languages, especially with adversarial syntax.
Contextual harms transfer more moderately across languages.
Lightweight inference strategies enable scalable, environmentally friendly multilingual safety testing.
Abstract
Most safety evaluations of large language models (LLMs) remain anchored in English. Translation is often used as a shortcut to probe multilingual behavior, but it rarely captures the full picture, especially when harmful intent or structure morphs across languages. Some types of harm survive translation almost intact, while others distort or disappear. To study this effect, we introduce CompositeHarm, a translation-based benchmark designed to examine how safety alignment holds up as both syntax and semantics shift. It combines two complementary English datasets, AttaQ, which targets structured adversarial attacks, and MMSafetyBench, which covers contextual, real-world harms, and extends them into six languages: English, Hindi, Assamese, Marathi, Kannada, and Gujarati. Using three large models, we find that attack success rates rise sharply in Indic languages, especially under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Explainable Artificial Intelligence (XAI)
