Assessing the Robustness of LLM-based NLP Software via Automated Testing

Mingxuan Xiao; Yan Xiao; Shunhui Ji; Hanbo Cai; Lei Xue; Pengcheng; Zhang

arXiv:2412.21016·cs.SE·March 18, 2025

Assessing the Robustness of LLM-based NLP Software via Automated Testing

Mingxuan Xiao, Yan Xiao, Shunhui Ji, Hanbo Cai, Lei Xue, Pengcheng, Zhang

PDF

Open Access

TL;DR

This paper introduces AORTA, an automated framework for testing the robustness of LLM-based NLP software, featuring a novel Adaptive Beam Search method that improves testing efficiency and effectiveness.

Contribution

The paper presents the first automated robustness testing framework for LLM-based NLP software and a new Adaptive Beam Search method tailored for large language models.

Findings

01

ABS achieves an average test success rate of 86.138%.

02

ABS significantly reduces computational overhead compared to baseline methods.

03

Test cases generated by ABS are more natural and transferable.

Abstract

Benefiting from the advancements in LLMs, NLP software has undergone rapid development. Such software is widely employed in various safety-critical tasks, such as financial sentiment analysis, toxic content moderation, and log generation. Unlike traditional software, LLM-based NLP software relies on prompts and examples as inputs. Given the complexity of LLMs and the unpredictability of real-world inputs, quantitatively assessing the robustness of such software is crucial. However, to the best of our knowledge, no automated robustness testing methods have been specifically designed to evaluate the overall inputs of LLM-based NLP software. To this end, this paper introduces the first AutOmated Robustness Testing frAmework, AORTA, which reconceptualizes the testing process into a combinatorial optimization problem. Existing testing methods designed for DNN-based software can be applied to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Reliability and Analysis Research