Knowledge Graph Guided Evaluation of Abstention Techniques
Kinshuk Vasisht, Navreet Kaur, Danish Pruthi

TL;DR
This paper introduces SELECT, a knowledge graph-based benchmark to evaluate how well language models abstain from inappropriate responses, revealing trade-offs and limitations of current abstention techniques.
Contribution
It presents a novel benchmark, SELECT, grounded in knowledge graphs, to systematically evaluate and compare abstention techniques in language models.
Findings
Abstention techniques achieve over 80% abstention on target concepts.
Effectiveness drops by 19% for descendants of target concepts.
No single technique outperforms others across all scenarios.
Abstract
To deploy language models safely, it is crucial that they abstain from responding to inappropriate requests. Several prior studies test the safety promises of models based on their effectiveness in blocking malicious requests. In this work, we focus on evaluating the underlying techniques that cause models to abstain. We create SELECT, a benchmark derived from a set of benign concepts (e.g., "rivers") from a knowledge graph. Focusing on benign concepts isolates the effect of safety training, and grounding these concepts in a knowledge graph allows us to study the generalization and specificity of abstention techniques. Using SELECT, we benchmark different abstention techniques over six open-weight and closed-source models. We find that the examined techniques indeed cause models to abstain with over abstention rates. However, these techniques are not as effective for descendants…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsEEG and Brain-Computer Interfaces · Intravenous Infusion Technology and Safety
MethodsDirect Preference Optimization
