Sotopia-RL: Reward Design for Social Intelligence
Haofei Yu, Zhengyang Qi, Yining Zhao, Kolby Nottingham, Keyang Xuan, Bodhisattwa Prasad Majumder, Hao Zhu, Paul Pu Liang, Jiaxuan You

TL;DR
This paper introduces Sotopia-RL, a reinforcement learning framework that designs multi-dimensional, utterance-level rewards to improve social intelligence in language models, leading to state-of-the-art social goal completion.
Contribution
It presents a novel reward design framework that refines coarse feedback into detailed, multi-dimensional rewards for social tasks, enhancing RL training effectiveness.
Findings
Achieves top social goal scores on Sotopia benchmarks.
Demonstrates importance of utterance-level credit assignment.
Validates multi-dimensional rewards reduce reward hacking.
Abstract
Social intelligence has become a critical capability for large language models (LLMs), enabling them to engage effectively in real-world social tasks such as collaboration and negotiation. Reinforcement learning (RL) is a natural fit for training socially intelligent agents because it allows models to learn sophisticated strategies directly through social interactions without requiring human annotations. However, there are two unique parts about social intelligence tasks: (1) the quality of individual utterances in social interactions is not strictly related to final success; (2) social interactions require multi-dimensional rubrics for success. Therefore, we argue that it is necessary to design rewards for building utterance-level multi-dimensional reward models to facilitate RL training for social intelligence tasks. To address these challenges, we propose Sotopia-RL, a novel…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Well-Motivated and Novel Problem Formulation: The paper compellingly argues that social intelligence tasks are fundamentally different from math or coding, making standard RL reward signals ineffective. The identification of "weak correlation" and "multi-dimensionality" as core challenges is precise and well-justified. Clear and Practical Methodology: The proposed two-stage pipeline (offline reward collection, online RL training) is clearly explained and seems practically implementable. The dist
Baseline Clarification and Comparison: The definition and implementation of some baselines (e.g., PPDPP, EPO, DAT, DSI in Table 1) are not sufficiently detailed in the main text, requiring the reader to hunt through citations. A brief summary of how these methods work and why they are relevant comparators would improve clarity. Furthermore, a comparison to simpler fine-tuning methods like Direct Preference Optimization (DPO) on the same data would have been a valuable baseline. Statistical Repor
- They show superior performance on social intelligence benchmarks, scoring 7.81 on SOTOPIA-hard and 8.57 on the SOTOPIA-all dataset - They provide a novel reward structure of Utterance-Level Credit Assignment and multi-dimensional rewards. - They show lack of reward hacking and robustness against overfitting
There are several weaknesses, that I would encourage the authors to address: - Limited human evaluation: You only conduct a small-scale human annotation study with 4 annotators (as noted in your Appendix) and your main evaluation relies on LLM-based automatic evaluators (GPT‑4o) which may not fully capture the full range of human social judgments. The number of human annotators and where they are recruited from should be in the main text. You should also note if you had an IRB for the study. - E
1. The paper tackles a meaningful and underexplored challenge, which is training agents for socially grounded interaction rather than factual or reasoning tasks. 2. The two-stage pipeline (offline LLM-based attribution and online GRPO optimization) is simple, reproducible, and builds upon recent trends in process and preference-based reward modeling. 3. Results on both Sotopia-hard and Sotopia-all benchmarks show consistent improvements, with well-designed ablations that isolate contributions
1. While the idea of converting coarse feedback into fine-grained, multi-dimensional rewards is meaningful in social RL, it is not fundamentally new in the broader LLM-RL landscape. The design closely parallels Process Reward Modeling (PRM) and other recent works on token- or step-level credit assignment in reasoning and coding domains. The novelty here mainly lies in applying such techniques to social environments rather than introducing a new RL principle. 2. The method’s heavy reliance on GP
1. Proposed a feasible social task reinforcement learning framework that can improve the performance of the model on specified tasks 2. The article uses multidimensional rewards, which enhances the robustness and density of the rewards 3. The article uses a credit assignment mechanism, which can effectively allocate the overall rewards to individuals
1. The paper did not confirm the lack of overlap between the dataset used for GRPO and the evaluation dataset, which may have led to unfairness in results 2. The motivation for using online reinforcement learning is unclear, and the main difficulty of the article is to train an effective reward model. However, this reward model is trained using offline data, and there may be overfitting issues when used in online reforcement learning. The reward model training epoch published in the paper is 60,
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
