Automating Behavioral Testing in Machine Translation
Javier Ferrando, Matthias Sperber, Hendra Setiawan, Dominic Telaar,, Sa\v{s}a Hasan

TL;DR
This paper introduces a novel LLM-based framework for automated, diverse behavioral testing of machine translation systems, uncovering nuanced differences and bugs beyond traditional accuracy metrics.
Contribution
It presents a scalable, minimally supervised approach using LLMs to generate and verify test cases for evaluating MT models' linguistic behavior.
Findings
Identified differences in MT system behaviors not seen with accuracy metrics
Demonstrated the framework's ability to uncover potential bugs in MT models
Showed that pass-rates correlate with traditional metrics but reveal additional insights
Abstract
Behavioral testing in NLP allows fine-grained evaluation of systems by examining their linguistic capabilities through the analysis of input-output behavior. Unfortunately, existing work on behavioral testing in Machine Translation (MT) is currently restricted to largely handcrafted tests covering a limited range of capabilities and languages. To address this limitation, we propose to use Large Language Models (LLMs) to generate a diverse set of source sentences tailored to test the behavior of MT models in a range of situations. We can then verify whether the MT model exhibits the expected behavior through matching candidate sets that are also generated using LLMs. Our approach aims to make behavioral testing of MT systems practical while requiring only minimal human effort. In our experiments, we apply our proposed evaluation framework to assess multiple available MT systems,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Natural Language Processing Techniques · Topic Modeling
