Investigating the Multilingual Calibration Effects of Language Model Instruction-Tuning

Jerry Huang; Peng Lu; Qiuhao Zeng; Yusuke Iwasawa; Yutaka Matsuo; Sarath Chandar; Edison Marrese-Taylor; Irene Li

arXiv:2601.01362·cs.CL·January 6, 2026

Investigating the Multilingual Calibration Effects of Language Model Instruction-Tuning

Jerry Huang, Peng Lu, Qiuhao Zeng, Yusuke Iwasawa, Yutaka Matsuo, Sarath Chandar, Edison Marrese-Taylor, Irene Li

PDF

Open Access

TL;DR

This paper investigates how instruction-tuning affects the calibration of multilingual language models, revealing that while confidence increases in low-resource languages, accuracy gains are limited, and label smoothing can improve calibration without additional data.

Contribution

It provides the first comprehensive analysis of multilingual calibration effects of instruction-tuning on LLMs, highlighting challenges and solutions for better reliability across languages.

Findings

01

Confidence increases in low-resource languages after instruction-tuning.

02

Accuracy improvements are marginal or non-existent in low-resource languages.

03

Label smoothing helps maintain better calibration without extra data.

Abstract

Ensuring that deep learning models are well-calibrated in terms of their predictive uncertainty is essential in maintaining their trustworthiness and reliability, yet despite increasing advances in foundation model research, the relationship between such large language models (LLMs) and their calibration remains an open area of research. In this work, we look at a critical gap in the calibration of LLMs within multilingual settings, in an attempt to better understand how the data scarcity can potentially lead to different calibration effects and how commonly used techniques can apply in these settings. Our analysis on two multilingual benchmarks, over 29 and 42 languages respectively, reveals that even in low-resource languages, model confidence can increase significantly after instruction-tuning on high-resource language SFT datasets. However, improvements in accuracy are marginal or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Natural Language Processing Techniques · Topic Modeling