CUB: Benchmarking Context Utilisation Techniques for Language Models
Lovisa Hagstr\"om, Youna Kim, Haeun Yu, Sang-goo Lee, Richard Johansson, Hyunsoo Cho, Isabelle Augenstein

TL;DR
This paper introduces CUB, a comprehensive benchmark for evaluating context utilisation techniques in language models, revealing their limitations across diverse noisy contexts and datasets.
Contribution
The paper presents the first systematic benchmark for CMTs, providing extensive evaluation and exposing gaps in current methods' robustness and generalization.
Findings
Most CMTs struggle with diverse real-world contexts.
Many methods perform well on synthetic but not real datasets.
Current evaluation practices are insufficient for comprehensive assessment.
Abstract
Incorporating external knowledge is crucial for knowledge-intensive tasks, such as question answering and fact checking. However, language models (LMs) may ignore relevant information that contradicts outdated parametric memory or be distracted by irrelevant contexts. While many context utilisation manipulation techniques (CMTs) have recently been proposed to alleviate these issues, few have seen systematic comparison. In this paper, we develop CUB (Context Utilisation Benchmark) - the first comprehensive benchmark designed to help diagnose CMTs under diverse noisy context conditions within retrieval-augmented generation (RAG). With this benchmark, we conduct the most extensive evaluation to date of seven state-of-the-art methods, representative of the main categories of CMTs, across three diverse datasets and tasks, applied to 11 LMs. Our findings expose critical gaps in current CMT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
