HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

Euntae Kim; Soomin Han; and Buru Chang

arXiv:2604.19274·cs.CL·April 22, 2026

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

Euntae Kim, Soomin Han, and Buru Chang

PDF

1 Repo 1 Datasets

TL;DR

HarDBench is a new benchmark designed to evaluate and improve the safety of large language models in collaborative writing, focusing on their vulnerability to harmful content generation in draft-based co-authoring scenarios.

Contribution

The paper introduces HarDBench, a systematic benchmark for assessing LLM robustness against draft-based jailbreak attacks, and proposes a safety-utility balanced alignment method.

Findings

01

Existing LLMs are highly vulnerable to draft-based jailbreak attacks.

02

The proposed alignment method significantly reduces harmful outputs.

03

The benchmark effectively evaluates LLM safety in high-risk domains.

Abstract

Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to complete, revise, and refine their content. However, this capability poses a serious safety risk: malicious users could jailbreak the models-filling incomplete drafts with dangerous content-to force them into generating harmful outputs. In this paper, we identify the vulnerability of current LLMs to such draft-based co-authoring jailbreak attacks and introduce HarDBench, a systematic benchmark designed to evaluate the robustness of LLMs against this emerging threat. HarDBench spans a range of high-risk domains-including Explosives, Drugs, Weapons, and Cyberattacks-and features prompts with realistic structure and domain-specific cues to assess the model susceptibility to harmful completions. To mitigate this risk, we introduce a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

untae0122/HarDBench
github

Datasets

untae/HarDBench
dataset· 20 dl
20 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.