HRIPBench: Benchmarking LLMs in Harm Reduction Information Provision to Support People Who Use Drugs

Kaixuan Wang; Chenxin Diao; Jason T. Jacques; Zhongliang Guo; and Shuai Zhao

arXiv:2507.21815·cs.CL·July 30, 2025

HRIPBench: Benchmarking LLMs in Harm Reduction Information Provision to Support People Who Use Drugs

Kaixuan Wang, Chenxin Diao, Jason T. Jacques, Zhongliang Guo, and Shuai Zhao

PDF

1 Datasets

TL;DR

This paper introduces HRIPBench, a benchmark for evaluating large language models' accuracy and safety in providing harm reduction information to people who use drugs, revealing current models' limitations and safety concerns.

Contribution

The paper presents HRIPBench, a novel benchmark dataset and evaluation schemes to assess LLMs' performance in harm reduction contexts, highlighting their current shortcomings and safety risks.

Findings

01

State-of-the-art LLMs struggle with accuracy in harm reduction tasks.

02

LLMs sometimes pose severe safety risks to people who use drugs.

03

Caution is advised in deploying LLMs for harm reduction to prevent negative health outcomes.

Abstract

Millions of individuals' well-being are challenged by the harms of substance use. Harm reduction as a public health strategy is designed to improve their health outcomes and reduce safety risks. Some large language models (LLMs) have demonstrated a decent level of medical knowledge, promising to address the information needs of people who use drugs (PWUD). However, their performance in relevant tasks remains largely unexplored. We introduce HRIPBench, a benchmark designed to evaluate LLM's accuracy and safety risks in harm reduction information provision. The benchmark dataset HRIP-Basic has 2,160 question-answer-evidence pairs. The scope covers three tasks: checking safety boundaries, providing quantitative values, and inferring polysubstance use risks. We build the Instruction and RAG schemes to evaluate model behaviours based on their inherent knowledge and the integration of domain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

RayK/Harm_Reduction_QA_dataset_basic
dataset· 14 dl
14 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.