Cross-Lingual Stability and Bias in Instruction-Tuned Language Models for Humanitarian NLP
Poli Nemkova, Amrit Adhikari, Matthew Pearson, Vamsi Krishna Sadu, Mark V. Albert

TL;DR
This study systematically compares commercial and open-weight large language models for multilingual human rights violation detection, revealing that instruction alignment enhances stability and reliability across diverse languages, crucial for resource-limited humanitarian efforts.
Contribution
It provides the first empirical evaluation of commercial versus open-weight LLMs for multilingual human rights monitoring, highlighting the importance of instruction alignment for stability in low-resource languages.
Findings
Aligned models maintain stable accuracy across languages.
Open-weight models show prompt-language sensitivity and calibration drift.
Alignment, not scale, is key to multilingual stability.
Abstract
Humanitarian organizations face a critical choice: invest in costly commercial APIs or rely on free open-weight models for multilingual human rights monitoring. While commercial systems offer reliability, open-weight alternatives lack empirical validation -- especially for low-resource languages common in conflict zones. This paper presents the first systematic comparison of commercial and open-weight large language models (LLMs) for human-rights-violation detection across seven languages, quantifying the cost-reliability trade-off facing resource-constrained organizations. Across 78,000 multilingual inferences, we evaluate six models -- four instruction-aligned (Claude-Sonnet-4, DeepSeek-V3, Gemini-Flash-2.0, GPT-4.1-mini) and two open-weight (LLaMA-3-8B, Mistral-7B) -- using both standard classification metrics and new measures of cross-lingual reliability: Calibration Deviation (CD),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
