Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning
Xiaotian Zhou, Di Tang, Xiaofeng Wang, Xiaozhong Liu

TL;DR
This paper presents GMRL-BD, a novel algorithm using reinforcement learning to identify topics where black-box LLMs may produce biased or untrustworthy responses, with limited queries.
Contribution
The authors introduce GMRL-BD, a new method leveraging knowledge graphs and multi-agent reinforcement learning to detect untrustworthy LLM boundaries efficiently.
Findings
Efficient detection of untrustworthy LLM topics with limited queries.
Demonstrated effectiveness across multiple popular LLMs.
Released a dataset with bias labels for various LLMs.
Abstract
Large Language Models (LLMs) have shown a high capability in answering questions on a diverse range of topics. However, these models sometimes produce biased, ideologized or incorrect responses, limiting their applications if there is no clear understanding of which topics their answers can be trusted. In this research, we introduce a novel algorithm, named as GMRL-BD, designed to identify the untrustworthy boundaries (in terms of topics) of a given LLM, with black-box access to the LLM and under specific query constraints. Based on a general Knowledge Graph (KG) derived from Wikipedia, our algorithm incorporates with multiple reinforcement learning agents to efficiently identify topics (some nodes in KG) where the LLM is likely to generate biased answers. Our experiments demonstrated the efficiency of our algorithm, which can detect the untrustworthy boundary with just limited queries…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
