HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
Euntae Kim, Soomin Han, and Buru Chang

TL;DR
HarDBench is a new benchmark designed to evaluate and improve the safety of large language models in collaborative writing, focusing on their vulnerability to harmful content generation in draft-based co-authoring scenarios.
Contribution
The paper introduces HarDBench, a systematic benchmark for assessing LLM robustness against draft-based jailbreak attacks, and proposes a safety-utility balanced alignment method.
Findings
Existing LLMs are highly vulnerable to draft-based jailbreak attacks.
The proposed alignment method significantly reduces harmful outputs.
The benchmark effectively evaluates LLM safety in high-risk domains.
Abstract
Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to complete, revise, and refine their content. However, this capability poses a serious safety risk: malicious users could jailbreak the models-filling incomplete drafts with dangerous content-to force them into generating harmful outputs. In this paper, we identify the vulnerability of current LLMs to such draft-based co-authoring jailbreak attacks and introduce HarDBench, a systematic benchmark designed to evaluate the robustness of LLMs against this emerging threat. HarDBench spans a range of high-risk domains-including Explosives, Drugs, Weapons, and Cyberattacks-and features prompts with realistic structure and domain-specific cues to assess the model susceptibility to harmful completions. To mitigate this risk, we introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
