Know Or Not: a library for evaluating out-of-knowledge base robustness
Jessica Foo, Pradyumna Shyama Prasad, Shaun Khoo

TL;DR
This paper introduces 'knowornot', an open-source library for evaluating the robustness of large language models in high-stakes scenarios, specifically their ability to abstain from answering questions outside their knowledge base in retrieval-augmented generation settings.
Contribution
It presents a novel systematic methodology and a flexible, extensible library for assessing out-of-knowledge base robustness without manual annotations.
Findings
Developed a comprehensive benchmark, PolicyBench, for government policy QA chatbots.
Demonstrated the utility of knowornot in evaluating LLMs' OOKB robustness.
Provided a modular, reproducible framework for customized robustness evaluation.
Abstract
While the capabilities of large language models (LLMs) have progressed significantly, their use in high-stakes applications have been limited due to risks of hallucination. One key approach in reducing hallucination is retrieval-augmented generation (RAG), but even in such setups, LLMs may still hallucinate when presented with questions outside of the knowledge base. Such behavior is unacceptable in high-stake applications where LLMs are expected to abstain from answering queries it does not have sufficient context on. In this work, we present a novel methodology for systematically evaluating out-of-knowledge base (OOKB) robustness of LLMs (whether LLMs know or do not know) in the RAG setting, without the need for manual annotation of gold standard answers. We implement our methodology in knowornot, an open-source library that enables users to develop their own customized evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAccess Control and Trust · Data Quality and Management · Software System Performance and Reliability
MethodsAttention Is All You Need · Linear Warmup With Linear Decay · Softmax · Attention Dropout · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Residual Connection · Byte Pair Encoding · Weight Decay
