SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists
Raoyuan Zhao, Abdullatif K\"oksal, Yihong Liu, Leonie Weissweiler,, Anna Korhonen, Hinrich Sch\"utze

TL;DR
SYNTHEVAL is a hybrid framework that uses large language models to generate diverse test cases for NLP models, combining automated generation with expert analysis to identify model weaknesses.
Contribution
It introduces a novel hybrid approach that leverages LLMs for test generation and human expertise for failure analysis, reducing manual effort in behavioral testing.
Findings
Effectively identifies weaknesses in sentiment analysis and toxic language detection models.
Demonstrates the utility of LLM-generated test cases for comprehensive NLP evaluation.
Provides a publicly available codebase for reproducibility.
Abstract
Traditional benchmarking in NLP typically involves using static held-out test sets. However, this approach often results in an overestimation of performance and lacks the ability to offer comprehensive, interpretable, and dynamic assessments of NLP models. Recently, works like DynaBench (Kiela et al., 2021) and CheckList (Ribeiro et al., 2020) have addressed these limitations through behavioral testing of NLP models with test types generated by a multistep human-annotated pipeline. Unfortunately, manually creating a variety of test types requires much human labor, often at prohibitive cost. In this work, we propose SYNTHEVAL, a hybrid behavioral testing framework that leverages large language models (LLMs) to generate a wide range of test types for a comprehensive evaluation of NLP models. SYNTHEVAL first generates sentences via LLMs using controlled generation, and then identifies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
