Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements
Yifei Wang, Tianlin Li, Xiaohan Zhang, Xiaoyu Zhang, Wei Ma, Mingfei Cheng, Li Pan

TL;DR
This paper introduces PrecisionDiff, a framework for systematically detecting subtle behavioral disagreements in large language models caused by different numerical precisions, which are often overlooked by standard evaluation methods.
Contribution
The paper presents a novel automated differential testing framework that uncovers precision-induced behavioral disagreements in LLMs, improving evaluation and robustness.
Findings
Behavioral disagreements are widespread across open-source LLMs and precision settings.
PrecisionDiff significantly outperforms traditional testing methods in detecting these issues.
Precision-induced divergences can lead to harmful responses in certain cases.
Abstract
Large language models (LLMs) are increasingly deployed under diverse numerical precision configurations, including standard floating-point formats (e.g., bfloat16 and float16) and quantized integer formats (e.g., int16 and int8), to meet efficiency and resource constraints. However, minor inconsistencies between LLMs of different precisions are difficult to detect and are often overlooked by existing evaluation methods. In this paper, we present PrecisionDiff, an automated differential testing framework for systematically detecting precision-induced behavioral disagreements in LLMs. PrecisionDiff generates precision-sensitive test inputs and performs cross-precision comparative analysis to uncover subtle divergences that remain hidden under conventional testing strategies. To demonstrate its practical significance, we instantiate PrecisionDiff on the alignment verification task, where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
