Models of reference production: How do they withstand the test of time?
Fahime Same, Guanyi Chen, Kees van Deemter

TL;DR
This paper critically examines the robustness of models for reference production in NLP, revealing that older benchmarks like GREC are unreliable and that pre-trained models offer more consistent performance across datasets.
Contribution
The study highlights the limitations of traditional benchmarks like GREC and demonstrates that pre-trained language models are more robust in reference production tasks.
Findings
GREC benchmark is unreliable for assessing model performance.
Pre-trained models are less affected by dataset choice.
Model performance varies significantly with evaluation metrics.
Abstract
In recent years, many NLP studies have focused solely on performance improvement. In this work, we focus on the linguistic and scientific aspects of NLP. We use the task of generating referring expressions in context (REG-in-context) as a case study and start our analysis from GREC, a comprehensive set of shared tasks in English that addressed this topic over a decade ago. We ask what the performance of models would be if we assessed them (1) on more realistic datasets, and (2) using more advanced methods. We test the models using different evaluation metrics and feature selection experiments. We conclude that GREC can no longer be regarded as offering a reliable assessment of models' ability to mimic human reference production, because the results are highly impacted by the choice of corpus and evaluation metrics. Our results also suggest that pre-trained language models are less…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsFeature Selection · Focus
