EVALOOOP: A Self-Consistency-Centered Framework for Assessing Large Language Model Robustness in Programming

Sen Fang; Weiyuan Ding; Mengshi Zhang; Zihao Chen; Bowen Xu

arXiv:2505.12185·cs.SE·February 17, 2026

EVALOOOP: A Self-Consistency-Centered Framework for Assessing Large Language Model Robustness in Programming

Sen Fang, Weiyuan Ding, Mengshi Zhang, Zihao Chen, Bowen Xu

PDF

Open Access

TL;DR

EVALOOOP introduces a self-consistency-based framework for assessing LLM robustness in programming by iteratively transforming code and natural language, providing a unified metric that captures intrinsic stability without external attacks.

Contribution

The paper proposes EVALOOOP, a novel robustness assessment framework that evaluates LLMs through iterative self-referential transformations, addressing limitations of traditional adversarial attacks.

Findings

01

EVALOOOP reveals a 2.65%-47.62% accuracy drop across models.

02

Robustness does not always correlate with initial performance.

03

Some models outperform others in robustness despite lower initial accuracy.

Abstract

Evaluating the programming robustness of large language models (LLMs) is paramount for ensuring their reliability in AI-based software development. However, adversarial attacks exhibit fundamental limitations that compromise fair robustness assessment: they demonstrate contradictory evaluation outcomes where different attack strategies tend to favor different models, and more critically, they operate solely through external perturbations, failing to capture the intrinsic stability essential for autonomous coding agents where subsequent inputs are endogenously generated by the model itself. We introduce EVALOOOP, a novel assessment framework that evaluates robustness from a self-consistency perspective, leveraging the natural duality inherent in software engineering tasks (e.g., code generation and code summarization). EVALOOOP establishes a self-contained feedback loop where an LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software System Performance and Reliability · Software Reliability and Analysis Research

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Is All You Need · Label Smoothing · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Cosine Annealing · Attention Dropout · Residual Connection