HRIPBench: Benchmarking LLMs in Harm Reduction Information Provision to Support People Who Use Drugs
Kaixuan Wang, Chenxin Diao, Jason T. Jacques, Zhongliang Guo, and Shuai Zhao

TL;DR
This paper introduces HRIPBench, a benchmark for evaluating large language models' accuracy and safety in providing harm reduction information to people who use drugs, revealing current models' limitations and safety concerns.
Contribution
The paper presents HRIPBench, a novel benchmark dataset and evaluation schemes to assess LLMs' performance in harm reduction contexts, highlighting their current shortcomings and safety risks.
Findings
State-of-the-art LLMs struggle with accuracy in harm reduction tasks.
LLMs sometimes pose severe safety risks to people who use drugs.
Caution is advised in deploying LLMs for harm reduction to prevent negative health outcomes.
Abstract
Millions of individuals' well-being are challenged by the harms of substance use. Harm reduction as a public health strategy is designed to improve their health outcomes and reduce safety risks. Some large language models (LLMs) have demonstrated a decent level of medical knowledge, promising to address the information needs of people who use drugs (PWUD). However, their performance in relevant tasks remains largely unexplored. We introduce HRIPBench, a benchmark designed to evaluate LLM's accuracy and safety risks in harm reduction information provision. The benchmark dataset HRIP-Basic has 2,160 question-answer-evidence pairs. The scope covers three tasks: checking safety boundaries, providing quantitative values, and inferring polysubstance use risks. We build the Instruction and RAG schemes to evaluate model behaviours based on their inherent knowledge and the integration of domain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
