Establishing Trustworthiness: Rethinking Tasks and Model Evaluation
Robert Litschko, Max M\"uller-Eberstein, Rob van der Goot, Leon Weber,, Barbara Plank

TL;DR
This paper advocates for a holistic approach to evaluating NLP models, emphasizing trustworthiness and reliability over traditional task-specific metrics, especially in the context of large language models and real-world applications.
Contribution
It proposes a rethinking of NLP tasks and evaluation methods, emphasizing trustworthiness and holistic assessment for large language models.
Findings
Traditional task-based evaluation is insufficient for LLMs.
Multi-faceted evaluation protocols are recommended.
Trustworthiness should be central in NLP model assessment.
Abstract
Language understanding is a multi-faceted cognitive capability, which the Natural Language Processing (NLP) community has striven to model computationally for decades. Traditionally, facets of linguistic intelligence have been compartmentalized into tasks with specialized model architectures and corresponding evaluation protocols. With the advent of large language models (LLMs) the community has witnessed a dramatic shift towards general purpose, task-agnostic approaches powered by generative models. As a consequence, the traditional compartmentalized notion of language tasks is breaking down, followed by an increasing challenge for evaluation and analysis. At the same time, LLMs are being deployed in more real-world scenarios, including previously unforeseen zero-shot setups, increasing the need for trustworthy and reliable systems. Therefore, we argue that it is time to rethink what…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques
