AEON: A Method for Automatic Evaluation of NLP Test Cases
Jen-tse Huang, Jianping Zhang, Wenxuan Wang, Pinjia He, Yuxin Su,, Michael R. Lyu

TL;DR
AEON is a novel automatic evaluation method for NLP test cases that effectively assesses semantic similarity and naturalness, reducing false alarms and improving model robustness.
Contribution
This paper introduces AEON, a new automatic evaluation approach that outperforms existing metrics in detecting semantic inconsistency and unnaturalness in NLP test cases.
Findings
AEON achieves 10% higher precision in detecting semantic inconsistencies.
AEON surpasses baselines by over 15% in identifying unnatural test cases.
Using AEON-prioritized test cases improves NLP model accuracy and robustness.
Abstract
Due to the labor-intensive nature of manual test oracle construction, various automated testing techniques have been proposed to enhance the reliability of Natural Language Processing (NLP) software. In theory, these techniques mutate an existing test case (e.g., a sentence with its label) and assume the generated one preserves an equivalent or similar semantic meaning and thus, the same label. However, in practice, many of the generated test cases fail to preserve similar semantic meaning and are unnatural (e.g., grammar errors), which leads to a high false alarm rate and unnatural test cases. Our evaluation study finds that 44% of the test cases generated by the state-of-the-art (SOTA) approaches are false alarms. These test cases require extensive manual checking effort, and instead of improving NLP software, they can even degrade NLP software when utilized in model training. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research
