Detectors for Safe and Reliable LLMs: Implementations, Uses, and   Limitations

Swapnaja Achintalwar; Adriana Alvarado Garcia; Ateret Anaby-Tavor,; Ioana Baldini; Sara E. Berger; Bishwaranjan Bhattacharjee; Djallel; Bouneffouf; Subhajit Chaudhury; Pin-Yu Chen; Lamogha Chiazor; Elizabeth M.; Daly; Kirushikesh DB; Rog\'erio Abreu de Paula; Pierre Dognin; Eitan Farchi,; Soumya Ghosh; Michael Hind; Raya Horesh; George Kour; Ja Young Lee; Nishtha; Madaan; Sameep Mehta; Erik Miehling; Keerthiram Murugesan; Manish Nagireddy,; Inkit Padhi; David Piorkowski; Ambrish Rawat; Orna Raz; Prasanna Sattigeri,; Hendrik Strobelt; Sarathkrishna Swaminathan; Christoph Tillmann; Aashka; Trivedi; Kush R. Varshney; Dennis Wei; Shalisha Witherspooon; Marcel; Zalmanovici

arXiv:2403.06009·cs.LG·August 20, 2024·2 cites

Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations

Swapnaja Achintalwar, Adriana Alvarado Garcia, Ateret Anaby-Tavor,, Ioana Baldini, Sara E. Berger, Bishwaranjan Bhattacharjee, Djallel, Bouneffouf, Subhajit Chaudhury, Pin-Yu Chen, Lamogha Chiazor, Elizabeth M., Daly, Kirushikesh DB, Rog\'erio Abreu de Paula, Pierre Dognin

PDF

Open Access 5 Models

TL;DR

This paper discusses the development and deployment of compact detectors for identifying risks in large language models, highlighting their uses, challenges, and future directions for improving AI safety and governance.

Contribution

It introduces a library of simple, effective classifiers for detecting various harms in LLMs and explores their applications and limitations.

Findings

01

Detectors can identify biases, toxicity, and non-faithful outputs.

02

They serve as safety guardrails and support AI governance.

03

Challenges include reliability and scope expansion.

Abstract

Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations. Due to several limiting factors surrounding LLMs (training cost, API access, data availability, etc.), it may not always be feasible to impose direct safety constraints on a deployed model. Therefore, an efficient and reliable alternative is required. To this end, we present our ongoing efforts to create and deploy a library of detectors: compact and easy-to-build classification models that provide labels for various harms. In addition to the detectors themselves, we discuss a wide range of uses for these detector models - from acting as guardrails to enabling effective AI governance. We also deep dive into inherent challenges in their development and discuss future work aimed at making the detectors more reliable and broadening their scope.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParticle Detector Development and Performance

MethodsLib