A Discussion on Building Practical NLP Leaderboards: The Case of Machine Translation
Sebastin Santy, Prasanta Bhattacharya

TL;DR
This paper discusses the limitations of current NLP leaderboards, especially in machine translation, emphasizing the need for more practical metrics to better reflect real-world utility and guide future research.
Contribution
It highlights the risks of relying solely on accuracy metrics in NLP leaderboards and offers suggestions for developing more practical evaluation frameworks.
Findings
Over-reliance on accuracy metrics can misrepresent model utility.
Current leaderboards may not reflect real-world performance needs.
Proposes guidelines for more effective and practical NLP leaderboards.
Abstract
Recent advances in AI and ML applications have benefited from rapid progress in NLP research. Leaderboards have emerged as a popular mechanism to track and accelerate progress in NLP through competitive model development. While this has increased interest and participation, the over-reliance on single, and accuracy-based metrics have shifted focus from other important metrics that might be equally pertinent to consider in real-world contexts. In this paper, we offer a preliminary discussion of the risks associated with focusing exclusively on accuracy metrics and draw on recent discussions to highlight prescriptive suggestions on how to develop more practical and effective leaderboards that can better reflect the real-world utility of models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
