Articulate but Wrong: Self-Review Failures in LLM-Based Code Modernization

Gokul Chandra Purnachandra Reddy; Aditya Lolla; Harsha Sanku

arXiv:2605.21537·cs.SE·May 22, 2026

Articulate but Wrong: Self-Review Failures in LLM-Based Code Modernization

Gokul Chandra Purnachandra Reddy, Aditya Lolla, Harsha Sanku

PDF

TL;DR

This study evaluates the reliability of large language models in self-assessing the semantic preservation of code modernization, revealing prevalent silent failures and limitations in self-review mechanisms across multiple models.

Contribution

It provides a comprehensive empirical analysis of semantic drift and self-review failures in LLM-based code modernization, with a new dataset, evaluation framework, and insights into model reliability.

Findings

01

Semantic drift occurs in nearly 40% of modernization attempts on challenging snippets.

02

Models often fail to recognize their own output's semantic changes, with a 31.7% silent endorsement rate.

03

Drift rates vary widely across models and do not correlate directly with model size or cost.

Abstract

Large language model (LLM) agents are increasingly used to migrate legacy code to modern stacks. We ask a deceptively simple question: when an LLM modernizes legacy code, can the same model be relied upon to recognize when its own output silently changes observable behavior? We run 1,980 real modernization calls across 11 production LLMs from 7 distinct families on a balanced 60-snippet legacy-Python-2 corpus, evaluate every output with a type-strict behavioral oracle, and then ask each model to judge whether its own output preserves behavior. We report four findings. (1) Semantic-preservation drift is prevalent and sharply separable from a cleanly-controlled baseline: semantic-trap snippets drift in 39.7% of attempts versus 7.0% on benign-control code that requires no real modernization (+32.7 percentage points; n=660 each). (2) Drift concentrates on specific snippets that fail across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.