A semantic mutation metric for metamorphic relation adequacy in scientific computing programs

Meng Li (1,2,3); Xiaohua Yang (1,2,3); Jie Liu (1,2,3); Shiyu Yan (1,2,3) ((1) School of Computing; University of South China; Hengyang; 421001; China (2) Hunan Engineering Research Center of Software Evaluation; Testing for Intellectual Equipment; Hengyang; 421001; China (3) CNNC Key Laboratory on High Trusted Computing; Hengyang; 421001; China)

arXiv:2605.17437·cs.SE·May 19, 2026

A semantic mutation metric for metamorphic relation adequacy in scientific computing programs

Meng Li (1,2,3), Xiaohua Yang (1,2,3), Jie Liu (1,2,3), Shiyu Yan (1,2,3) ((1) School of Computing, University of South China, Hengyang, 421001, China (2) Hunan Engineering Research Center of Software Evaluation, Testing for Intellectual Equipment, Hengyang, 421001

PDF

TL;DR

This paper introduces the Semantic Mutation Score (SMS), a new domain-semantic mutation metric for evaluating metamorphic relations in scientific computing programs, addressing limitations of classical syntactic mutation scores.

Contribution

The paper proposes SMS, a semantic mutation metric built on five domain-specific operators, and demonstrates its effectiveness in assessing metamorphic relation adequacy in scientific computing.

Findings

01

SMS degenerates to classical MS in certain limits.

02

Cross-source pooling does not significantly affect the effect size.

03

Certain semantic mutation classes are unreachable with default syntactic configurations.

Abstract

Context. Metamorphic Testing addresses the test-oracle problem in scientific computing, but classical Mutation Score operates on syntactic AST mutations and misses domain semantics. Objective. We propose the Semantic Mutation Score (SMS), built on five domain-semantic operators (Conservation Erosion, Operator Substitution, Hyperparameter, Trajectory Flip, Structural Injection). SMS degenerates almost everywhere to MS in a characterised limit, so any SMS-based conclusion remains consistent with prior mutation-testing literature in the classical regime. Method. A 12-PUT x 5-MP design over four single-output float-to-float classes (numeric, probabilistic, surrogate, machine-learning) is paired with a three-layer attribution classifier separating true semantic faults from tolerance, OOD, statistical, and artefact categories. A same-source / cross-source ablation under an identical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.