Cross-Lingual Stability and Bias in Instruction-Tuned Language Models for Humanitarian NLP

Poli Nemkova; Amrit Adhikari; Matthew Pearson; Vamsi Krishna Sadu; Mark V. Albert

arXiv:2510.22823·cs.CL·October 28, 2025

Cross-Lingual Stability and Bias in Instruction-Tuned Language Models for Humanitarian NLP

Poli Nemkova, Amrit Adhikari, Matthew Pearson, Vamsi Krishna Sadu, Mark V. Albert

PDF

TL;DR

This study systematically compares commercial and open-weight large language models for multilingual human rights violation detection, revealing that instruction alignment enhances stability and reliability across diverse languages, crucial for resource-limited humanitarian efforts.

Contribution

It provides the first empirical evaluation of commercial versus open-weight LLMs for multilingual human rights monitoring, highlighting the importance of instruction alignment for stability in low-resource languages.

Findings

01

Aligned models maintain stable accuracy across languages.

02

Open-weight models show prompt-language sensitivity and calibration drift.

03

Alignment, not scale, is key to multilingual stability.

Abstract

Humanitarian organizations face a critical choice: invest in costly commercial APIs or rely on free open-weight models for multilingual human rights monitoring. While commercial systems offer reliability, open-weight alternatives lack empirical validation -- especially for low-resource languages common in conflict zones. This paper presents the first systematic comparison of commercial and open-weight large language models (LLMs) for human-rights-violation detection across seven languages, quantifying the cost-reliability trade-off facing resource-constrained organizations. Across 78,000 multilingual inferences, we evaluate six models -- four instruction-aligned (Claude-Sonnet-4, DeepSeek-V3, Gemini-Flash-2.0, GPT-4.1-mini) and two open-weight (LLaMA-3-8B, Mistral-7B) -- using both standard classification metrics and new measures of cross-lingual reliability: Calibration Deviation (CD),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.