Reinforcing Human Behavior Simulation via Verbal Feedback

Weiwei Sun; Xuhui Zhou; Jiarui Liu; Weihua Du; Haojia Sun; Yiqing Xie; Qianou Ma; Sihao Chen; Mengting Wan; Longqi Yang; Pei Zhou; Sherry Wu; Sean Welleck; Graham Neubig; Yiming Yang; Maarten Sap

arXiv:2605.20506·cs.LG·May 21, 2026

Reinforcing Human Behavior Simulation via Verbal Feedback

Weiwei Sun, Xuhui Zhou, Jiarui Liu, Weihua Du, Haojia Sun, Yiqing Xie, Qianou Ma, Sihao Chen, Mengting Wan, Longqi Yang, Pei Zhou, Sherry Wu, Sean Welleck, Graham Neubig, Yiming Yang, Maarten Sap

PDF

1 Models 1 Datasets

TL;DR

This paper introduces DITTO, a reinforcement learning approach that uses verbal feedback to enhance LLMs in simulating human-like social behaviors across diverse tasks.

Contribution

The paper presents DITTO, a novel RL method that incorporates verbal feedback as a primary signal and introduces SOUL, a comprehensive benchmark for human-like behavior simulation.

Findings

01

DITTO achieves 36% improvement over the base model.

02

Exceeds GPT-5.4 on 6 out of 10 SOUL benchmarks.

03

Demonstrates RL with verbal feedback as a promising training direction.

Abstract

Humans learn social norms and behaviors from verbal feedback (e.g., a parent saying "that was rude" or a friend explaining "here's why that hurt"). Yet, learning from feedback for LLMs has largely focused on domains like code and math, where RL rewards are directly verifiable and condensed into scalar values. As LLMs are increasingly used to simulate human behavior, e.g., standing in for users, patients, students, and other personas, there is a pressing need to make them more human-like, which requires embracing a fundamentally different kind of signal: feedback that is verbal, subjective, and multi-faceted. We present DITTO, a model trained by treating verbal feedback as a first-class signal in reinforcement learning. After each rollout, DITTO receives verbal feedback and generates a feedback-conditioned improved rollout; both outputs are jointly optimized with GRPO, distilling verbal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
sunweiwei/Ditto-8B
model· 86 dl
86 dl

Datasets

sunweiwei/Soul
dataset· 41 dl
41 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.