Know Or Not: a library for evaluating out-of-knowledge base robustness

Jessica Foo; Pradyumna Shyama Prasad; Shaun Khoo

arXiv:2505.13545·cs.IR·July 22, 2025

Know Or Not: a library for evaluating out-of-knowledge base robustness

Jessica Foo, Pradyumna Shyama Prasad, Shaun Khoo

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces 'knowornot', an open-source library for evaluating the robustness of large language models in high-stakes scenarios, specifically their ability to abstain from answering questions outside their knowledge base in retrieval-augmented generation settings.

Contribution

It presents a novel systematic methodology and a flexible, extensible library for assessing out-of-knowledge base robustness without manual annotations.

Findings

01

Developed a comprehensive benchmark, PolicyBench, for government policy QA chatbots.

02

Demonstrated the utility of knowornot in evaluating LLMs' OOKB robustness.

03

Provided a modular, reproducible framework for customized robustness evaluation.

Abstract

While the capabilities of large language models (LLMs) have progressed significantly, their use in high-stakes applications have been limited due to risks of hallucination. One key approach in reducing hallucination is retrieval-augmented generation (RAG), but even in such setups, LLMs may still hallucinate when presented with questions outside of the knowledge base. Such behavior is unacceptable in high-stake applications where LLMs are expected to abstain from answering queries it does not have sufficient context on. In this work, we present a novel methodology for systematically evaluating out-of-knowledge base (OOKB) robustness of LLMs (whether LLMs know or do not know) in the RAG setting, without the need for manual annotation of gold standard answers. We implement our methodology in knowornot, an open-source library that enables users to develop their own customized evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

govtech-responsibleai/knowornot
noneOfficial

Datasets

govtech/PolicyBench
dataset· 24 dl
24 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAccess Control and Trust · Data Quality and Management · Software System Performance and Reliability

MethodsAttention Is All You Need · Linear Warmup With Linear Decay · Softmax · Attention Dropout · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Residual Connection · Byte Pair Encoding · Weight Decay