TL;DR
This paper replicates and extends a study on LLMs' cooperative behavior, validating previous findings and exploring new models, settings, and languages to assess the generalizability and influence of large language models in multi-agent cooperation.
Contribution
It provides a reproducibility validation of prior results and introduces new experiments with diverse models, environments, and languages to evaluate cooperative capabilities of LLMs.
Findings
Large models like GPT-4-turbo achieve sustainable cooperation.
Heterogeneous multi-agent systems show high-performing models influence others.
The benchmark applies successfully across different models, scenarios, and languages.
Abstract
This study evaluates and extends the findings made by Piatti et al., who introduced GovSim, a simulation framework designed to assess the cooperative decision-making capabilities of large language models (LLMs) in resource-sharing scenarios. By replicating key experiments, we validate claims regarding the performance of large models, such as GPT-4-turbo, compared to smaller models. The impact of the universalization principle is also examined, with results showing that large models can achieve sustainable cooperation, with or without the principle, while smaller models fail without it. In addition, we provide multiple extensions to explore the applicability of the framework to new settings. We evaluate additional models, such as DeepSeek-V3 and GPT-4o-mini, to test whether cooperative behavior generalizes across different architectures and model sizes. Furthermore, we introduce new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsADaptive gradient method with the OPTimal convergence rate
