Adaptive Social Learning via Mode Policy Optimization for Language Agents
Minzheng Wang, Yongbin Li, Haobo Wang, Xinghua Zhang, Nan Xu, Bingli Wu, Fei Huang, Haiyang Yu, Wenji Mao

TL;DR
This paper introduces an adaptive social learning framework for language agents that dynamically adjusts reasoning depth based on context, improving social interaction performance and token efficiency.
Contribution
It proposes a novel hierarchical reasoning mode and an adaptive mode policy optimization algorithm for context-aware, token-efficient social reasoning in language agents.
Findings
Achieves 15.6% higher task performance than GPT-4o.
Outperforms GRPO by 7.0% in social tasks.
Reduces reasoning chain length by 32.8%.
Abstract
Effective social intelligence simulation requires language agents to dynamically adjust reasoning depth, a capability notably absent in current studies. Existing methods either lack explicit reasoning or employ lengthy Chain-of-Thought reasoning uniformly across all scenarios, resulting in excessive token usage and inflexible social behaviors in tasks such as negotiation or collaboration. To address this, we propose an daptive ocial earning () framework in this paper, aiming to improve the adaptive reasoning ability of language agents in dynamic social interactions. To this end, we first identify the hierarchical reasoning modes under such context, ranging from intuitive response to deep deliberation based on the cognitive control theory. We then develop the daptive ode olicy ptimization…
Peer Reviews
Decision·ICLR 2026 Poster
1. Novel conceptual framing. Mapping hierarchical cognitive control into a small set of explicit, controllable reasoning modes and operationalizing them via control tokens is an original and compelling idea for resource-aware, controllable LLM behavior in social settings. 2. Practical algorithmic contribution (AMPO). The combined mode-level and sample-level advantage estimation in AMPO is a thoughtful improvement over mode-agnostic RL approaches; the technique is intuitive and appears effective
(A) Reproducibility gaps (major). Key elements required for exact reproduction are not fully specified in the paper: 1. The exact BC prompt templates and a set of representative prompt-->response examples for each mode are not included. 2. The full evaluator/judge prompt used to compute reward (and any LLM scoring parameters such as temperature, max_tokens, whether CoT is used) is not fully published. 3. Precise hyperparameters for BC and AMPO (learning rates, batch sizes, number of per-state ro
The paper’s main novelty is explicit, discrete reasoning modes for social interaction, plus dual-level advantages so the policy learns when to use shallow vs. deep reasoning. The proposed AMPO is an insightful combination of psychological knowledge and computational agent training. The algorithm and experiments are clearly presented in the paper. Abundant experiments showcase the improvement of the efficiency of the agent's reasoning process. This work is practically meaningful.
1. The human evaluators only provide the winner of the two outputs without participating in the reward design process, which still may make the results suffer from design/bias variance and proprietary drift. 2. The large format penalty and length penalty may bias the policy toward short/over-structured outputs, potentially suppressing socially nuanced behaviors. 3. The efficiency-motivated single-turn RL reduction could miss multi-turn dependencies typical in social dialogue.
1. The work is of good technical quality. The 3-step ASL framework (mode design, BC, AMPO) is logical and well-executed. The reasoning modes are grounded in cognitive science (HCCT). The design of AMPO is intuitive and well-justified. The experimental validation is exceptionally thorough. 2. The paper is well-written. The problem statement is clear and the proposed solution is easy to follow. 3. The novelty of the paper's method is significant. It demonstrates significant improvement on model so
1. Strong dependence on hand-crafted reasoning modes. 2. Scalability of AMPO with more modes: AMPO is demonstrated with $N=4$ modes. However, the calculation of the mode-level advantage $A^{\mathcal{M}}$ requires gathering sufficient samples for **each mode in each output group** to compute stable statistics (average reward and length). As the number of modes $N$ increases, this approach could suffer from sample inefficiency, making it difficult to scale. 3. Minor advice on notation: the mode no
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Reinforcement Learning in Robotics
