Moral Mazes in the Era of LLMs
Dang Nguyen, Harvey Yiyun Fu, Peter West, Ari Holtzman, Chenhao Tan

TL;DR
This study examines how large language models (LLMs) can navigate complex workplace social norms through a simulated email task, revealing systematic differences from humans and potential for reshaping professional communication.
Contribution
The paper introduces HR Simulator, a novel benchmark for evaluating LLMs in workplace social scenarios, and analyzes their communication style, performance, and emergent norms compared to humans.
Findings
LLM emails are more formal and empathetic than human emails.
Humans underperform compared to LLMs in scenario pass rates.
Rewritten human emails by LLMs can outperform both humans and original LLMs.
Abstract
Navigating complex social situations is an integral part of corporate life, ranging from giving critical feedback without hurting morale to rejecting requests without alienating teammates. Although large language models (LLMs) are permeating the workplace, it is unclear how well they can navigate these norms. To investigate this question, we created HR Simulator, a game where users roleplay as an HR officer and write emails to tackle challenging workplace scenarios, evaluated with GPT-4o as a judge based on scenario-specific rubrics. We analyze over 600 human and LLM emails and find systematic differences in style: LLM emails are more formal and empathetic. Furthermore, humans underperform LLMs (e.g., 23.5% vs. 48-54% scenario pass rate), but human emails rewritten by LLMs can outperform both, which indicates a hybrid advantage. On the evaluation side, judges can exhibit differences in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
