BehaviorSFT: Behavioral Token Conditioning for Clinical Agents Across the Proactivity Spectrum

Yubin Kim; Zhiyuan Hu; Hyewon Jeong; Eugene Park; Shuyue Stella Li; Chanwoo Park; Shiyun Xiong; MingYu Lu; Hyeonhoon Lee; Xin Liu; Daniel McDuff; Cynthia Breazeal; Samir Tulebaev; Hae Won Park

arXiv:2505.21757·cs.CL·May 29, 2025

BehaviorSFT: Behavioral Token Conditioning for Clinical Agents Across the Proactivity Spectrum

Yubin Kim, Zhiyuan Hu, Hyewon Jeong, Eugene Park, Shuyue Stella Li, Chanwoo Park, Shiyun Xiong, MingYu Lu, Hyeonhoon Lee, Xin Liu, Daniel McDuff, Cynthia Breazeal, Samir Tulebaev, Hae Won Park

PDF

Open Access

TL;DR

This paper introduces BehaviorSFT, a training method for clinical language models that improves their ability to proactively and reactively assist clinicians, validated by a new behavioral dataset and positive expert evaluations.

Contribution

The paper presents BehaviorSFT, a novel behavioral token conditioning strategy that enhances LLMs' proactive engagement in clinical tasks, addressing a key limitation in existing models.

Findings

01

BehaviorSFT achieves up to 97.3% Macro F1 on BehaviorBench.

02

Proactive task scores improved from 95.0% to 96.5%.

03

Clinician evaluations show more realistic and balanced clinical behavior.

Abstract

Large Language Models (LLMs) as clinical agents require careful behavioral adaptation. While adept at reactive tasks (e.g., diagnosis reasoning), LLMs often struggle with proactive engagement, like unprompted identification of critical missing information or risks. We introduce BehaviorBench, a comprehensive dataset to evaluate agent behaviors across a clinical assistance spectrum, ranging from reactive query responses to proactive interventions (e.g., clarifying ambiguities, flagging overlooked critical data). Our BehaviorBench experiments reveal LLMs' inconsistent proactivity. To address this, we propose BehaviorSFT, a novel training strategy using behavioral tokens to explicitly condition LLMs for dynamic behavioral selection along this spectrum. BehaviorSFT boosts performance, achieving up to 97.3% overall Macro F1 on BehaviorBench and improving proactive task scores (e.g., from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling