How Reliable are Model Diagnostics?
Vamsi Aribandi, Yi Tay, Donald Metzler

TL;DR
This paper critically evaluates recent diagnostic tests for pre-trained language models, revealing that likelihood-based and representation-based diagnostics are currently unreliable, and offers recommendations for better practices.
Contribution
It provides an empirical assessment of existing model diagnostics, highlighting their limitations and proposing guidelines for more reliable evaluation methods.
Findings
Likelihood-based diagnostics are unreliable.
Representation-based diagnostics lack consistency.
Recommendations improve diagnostic reliability.
Abstract
In the pursuit of a deeper understanding of a model's behaviour, there is recent impetus for developing suites of probes aimed at diagnosing models beyond simple metrics like accuracy or BLEU. This paper takes a step back and asks an important and timely question: how reliable are these diagnostics in providing insight into models and training setups? We critically examine three recent diagnostic tests for pre-trained language models, and find that likelihood-based and representation-based model diagnostics are not yet as reliable as previously assumed. Based on our empirical findings, we also formulate recommendations for practitioners and researchers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
