Language and Experience: A Computational Model of Social Learning in Complex Tasks

C\'edric Colas; Tracey Mills; Ben Prystawski; Michael Henry Tessler; Noah Goodman; Jacob Andreas; Joshua Tenenbaum

arXiv:2509.00074·cs.AI·February 19, 2026

Language and Experience: A Computational Model of Social Learning in Complex Tasks

C\'edric Colas, Tracey Mills, Ben Prystawski, Michael Henry Tessler, Noah Goodman, Jacob Andreas, Joshua Tenenbaum

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a computational framework that models social learning by integrating linguistic guidance and sensorimotor experience, facilitating improved exploration and knowledge transfer in complex tasks for humans and AI.

Contribution

It presents a novel probabilistic model that turns pretrained language models into tools for social learning, enabling advice generation, interpretation, and cross-generational knowledge transfer.

Findings

01

Linguistic guidance accelerates learning and exploration.

02

Humans and models benefit from structured language-compatible representations.

03

Successful knowledge transfer demonstrated between humans and AI models.

Abstract

The ability to combine linguistic guidance from others with direct experience is central to human development, enabling safe and rapid learning in new environments. How do people integrate these two sources of knowledge, and how might AI systems? We present a computational framework that models social learning as joint probabilistic inference over structured, executable world models given sensorimotor and linguistic data. We make this possible by turning a pretrained language model into a probabilistic model of how humans share advice conditioned on their beliefs, allowing our agents both to generate advice for others and to interpret linguistic input as evidence during Bayesian inference. Using behavioral experiments and simulations across 10 video games, we show how linguistic guidance can shape exploration and accelerate learning by reducing risky interactions and speeding up key…

Equations16

a_{t} = ar g a max E_{s_{i + 1} \sim P (s_{i + 1} ∣ s_{i}, a)} [i = t \sum \infty γ^{i - t} r_{i}],

a_{t} = ar g a max E_{s_{i + 1} \sim P (s_{i + 1} ∣ s_{i}, a)} [i = t \sum \infty γ^{i - t} r_{i}],

P (T ∣ E, L) \propto P (E ∣ T) \times P (L ∣ T) \times P (T),

P (T ∣ E, L) \propto P (E ∣ T) \times P (L ∣ T) \times P (T),

P (r_{new} ∣ r_{old}, E, L, T) \propto ((P_{0} (r_{new} ∣ E, T) + P_{LM} (r_{new} ∣ prompt (L, T))) /2

P (r_{new} ∣ r_{old}, E, L, T) \propto ((P_{0} (r_{new} ∣ E, T) + P_{LM} (r_{new} ∣ prompt (L, T))) /2

Value (g) = ExplorationValue (g) + ExploitationValue (g) .

Value (g) = ExplorationValue (g) + ExploitationValue (g) .

ExplorationValue (g) = 1 - \frac{max _{i} count ( i ∣ g )}{\sum _{i} count ( i ∣ g )}

ExplorationValue (g) = 1 - \frac{max _{i} count ( i ∣ g )}{\sum _{i} count ( i ∣ g )}

ExploitationValue (g) = ⎩ ⎨ ⎧ 108620 if g contributes to win condition if g protects essential resources if g creates useful tools if g collects resources or rewards else

ExploitationValue (g) = ⎩ ⎨ ⎧ 108620 if g contributes to win condition if g protects essential resources if g creates useful tools if g collects resources or rewards else

V (a) = R_{game} (a) + R_{goal} (a) + R_{win/loss} (a),

V (a) = R_{game} (a) + R_{goal} (a) + R_{win/loss} (a),

V (a_{original}) > i \in 1...10 max V (a_{lookahead_{i}}) .

V (a_{original}) > i \in 1...10 max V (a_{lookahead_{i}}) .

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. Novelty in Theoretical Framework: The paper introduces an innovative computational model that treats language not merely as an input feature, but as probabilistic evidence for Bayesian inference over potential world models. This is a theoretically rigorous approach to language grounding. 2. High Sample Efficiency: By leveraging linguistic guidance as a strong prior, the method significantly reduces the hypothesis space, leading to substantially faster learning and higher sample efficiency co

Weaknesses

1. Scalability Challenge in Model Space: While the framework operates over "structured world models," the computational feasibility of performing inference (or even representation) over the vast space of possible models ($\text{Model}$) in real-world, high-dimensional, or continuous state environments remains a major concern. The paper needs to better address how efficient search or approximation is maintained. 2. Experimental Complexity Limitation: Although the title claims to address "Complex

Reviewer 02Rating 8Confidence 4

Strengths

(1) The paper is well-written, with clear motivation and an accessible presentation. The formulation of joint inference over experiential and linguistic guidance through Bayesian updating is both insightful and valuable. (2) The human evaluation experiments on building better learning partners are interesting and effectively demonstrate the role of linguistic guidance in improving video game performance. Moreover, the study’s extension to bidirectional communication between humans and models pr

Weaknesses

(1) Limited generalization: The current modeling approach is evaluated across different video games. However, in real-world agentic environments, the belief space can be far more complex, posing additional challenges for both modeling and linguistic guidance. As a result, the proposed method may have limited applicability beyond specific domains. (2) Model selection discussion: It would be valuable to include an analysis of different LLM choices. Stronger models may offer more effective guidanc

Reviewer 03Rating 6Confidence 3

Strengths

* The writing is clear and engaging. * The combination of behavioral experiments with humans and artificial agents is very creative. * The results on cross-generational knowledge transfer are intriguing and connect well to theories of cumulative culture and iterated learning. * The work highlights a bidirectional exchange between humans and models — both learn from each other, which is a timely topic to consider.

Weaknesses

* _Framing_ The title and framing promise a computational model of social learning, but the core of the paper is some empirical observations of social learning among and between agents and humans. The authors should consider revising the title or clarifying that the contribution is primarily empirical and conceptual, not a new formal model. * It is unclear what are the structured world models under the claimed “joint probabilistic inference over structured world models.” * The claim that the LL

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage and cultural evolution · Embodied and Extended Cognition · Child and Animal Learning Development

Full text

Language and Experience: a computational model of social learning in complex tasks

Cédric Colas

MIT &Tracey Mills

MIT &Ben Prystawski

Stanford University &**Michael Henry Tessler

**Google DeepMind

**Noah Goodman

**Stanford University &**Jacob Andreas

**MIT &**Joshua Tenenbaum

**MIT

Abstract

The ability to combine linguistic guidance from others with direct experience is central to human development, enabling safe and rapid learning in new environments. How do people integrate these two sources of knowledge, and how might AI systems? We present a computational framework that models human social learning as joint probabilistic inference over structured, executable world models given sensorimotor and linguistic data. We make this possible by turning a pretrained language model into a probabilistic model of how humans share advice conditioned on their beliefs, allowing our agents both to generate advice for others and to interpret linguistic input as evidence during Bayesian inference. Using behavioral experiments and simulations across 10 video games, we show how linguistic guidance can shape exploration and accelerate learning by reducing risky interactions and speeding up key discoveries in both humans and models. We further explore how knowledge can accumulate across generations through iterated learning experiments and demonstrate successful knowledge transfer between humans and models — revealing how structured, language-compatible representations might facilitate human-machine collaborative learning.

Code: github.com/ccolas/language_and_experience

Demo: cedriccolas.com/demos/language_and_experience

1 Introduction

Imagine learning to forage mushrooms in autumn woods. Each outing provides direct experience of promising slopes and soil conditions. But an experienced forager’s advice can transform your exploration in two crucial ways: “Never touch the red ones with white spots; they’re deadly” helps you avoid fatal mistakes, while “Look for chanterelles near oak trees after warm rains” turns random wandering into focused search. This ability to integrate linguistic guidance with direct experience is fundamental to human intelligence, enabling not only safer and more efficient learning, but also the accumulation of knowledge across generations (Tomasello, 2009; Boyd et al., 2011). Yet, we still lack a general computational account of how humans combine these two modes of knowledge acquisition to inform exploration and decision-making in complex tasks.

Current computational models capture only fragments of this dual learning capability. Reinforcement learning (RL) agents, for example, can master complex tasks but require extensive trial-and-error — millions of interaction steps — before achieving proficiency (Sutton and Barto, 2018; Mnih et al., 2015). Theory-based RL mitigates this limitation by combining planning with Bayesian inference over structured world models, achieving human-like sample efficiency, yet remains incapable of social learning (Tsividis et al., 2021; Griffiths et al., 2010; Lake et al., 2017). Attempts to bridge this gap with language-conditioned RL integrate linguistic input into the learning process but rely on massive amounts of paired experience and language, making real-world application impractical (Zhong et al., 2020; Luketina et al., 2019). While large language models (LLMs) excel at processing linguistic guidance flexibly (Brown et al., 2020), they struggle with interactive planning and embodied learning (Valmeekam et al., 2023; Paglieri et al., 2024). Bayesian models of social cognition provide a promising step towards integrating observed behaviors and language to infer goals (Shafto et al., 2014; Jara-Ettinger et al., 2016), yet they typically operate in simple, non-interactive tasks with predefined hypothesis spaces (Ying et al., 2023; Zhi-Xuan et al., 2024).111See more detailed related work in Appendix Section A.

Our primary aim is to contribute to cognitive science by offering a computational account of how humans integrate language and experience. We propose a Bayesian framework that treats linguistic guidance and direct experience as complementary evidence sources learners can leverage to infer executable, program-like world models. To make this possible, we introduce three key contributions:

•

LMs as speaker models: We leverage LMs to approximate the probability that a human with specific world beliefs would produce particular linguistic advice, enabling our model to interpret and generate human-interpretable guidance.

•

Inference from linguistic evidence: This speaker model is used to evaluate the plausibility of received advice under different candidate world models, allowing linguistic input to shape Bayesian inference alongside experiential learning.

•

LM-accelerated inference: We use LMs to transform advice into targeted proposal distributions, guiding Bayesian updates towards the most promising regions of the hypothesis space.

These mechanisms allow our computational model to learn efficiently from naturalistic linguistic input, update beliefs in real-time, and share discoveries with others — forming the basis for more structured and communicative learning systems.

We validate our framework through human experiments and computational simulations, showing that linguistic guidance accelerates learning, shapes exploration strategies, and supports knowledge transfer across generations as well as between humans and models. These results reveal how language drives cultural transmission and illustrate how structured, language-compatible representations can facilitate human–AI collaborative learning.

2 Video games as learning environments

Video games provide an ideal experimental paradigm for studying social and individual learning (Allen et al., 2024): they offer rich causal environments for systematic exploration, create natural pressure to learn efficiently through costs like lost lives, and enable the study of how complex mechanics can be communicated through language. Drawing inspiration from earlier work modeling human causal learning (Tsividis et al., 2021; Tomov et al., 2023), and cultural knowledge transmission (Tessler et al., 2021), we reuse a suite of 10 games defined in the Video Game Description Language (VGDL) (Schaul, 2013).

VGDL lets us define games by specifying object types, collision effects, rewards, and win/loss conditions (e.g. yellow objects are deadly, victory requires eliminating green objects; see Figure 2 and DSL description in Appendix Section C). This formal representation enables precise experimental control and provides a structured language for our model to represent and update hypotheses about game dynamics. Players, however, encounter the games as grids of colored squares, having to discover game dynamics and objectives through exploration. Our games were designed to test diverse learning capabilities: some require spatial reasoning to navigate hazards (pushBoulders) or use teleportation mechanics (portals); others demand quick tactical decisions like shooting threats (aliens) or defending resources (plaqueAttack). Most challenging are games requiring systematic experimentation to discover novel objects through object combinations (relational). Play them here.

3 Computational model of integrated learning

We frame game learning as a problem of sequential decision-making under uncertainty (Kaelbling et al., 1998). At each time step $t$ , agents select actions $a_{t}$ to maximize expected cumulative rewards:

[TABLE]

where $s_{i}$ and $r_{i}$ are the agent’s state and reward at time $i$ , and $\gamma$ is a discount factor. The main challenge lies in the agent’s uncertainty about the transition function $P(s_{i+1}\,|\,s_{i},a)$ , which is governed by unknown game dynamics.

Our approach extends the theory-based RL framework by proposing to infer causal world models jointly from experience and linguistic guidance to support goal-directed planning and strategic exploration (Tsividis et al., 2021). The model alternates between three phases: 1) it constructs and infers a posterior over probabilistic world models given both experience and linguistic guidance, 2) it plans action sequences by identifying and pursuing high-value interactions that balance exploration and exploitation, and (3) it executes these actions in the game. The following sections detail each component of this learning loop.

3.1 Inference of structured, causal world models

Our agent models its environment with a distribution over structured world models, each represented as a probabilistic program specifying game rules and objectives. These beliefs are continually updated as the agent gathers new evidence from gameplay experience $E$ and linguistic guidance $L$ . We formalize these beliefs update as a Bayesian inference over possible world theories $T$ :

[TABLE]

where $P(T)$ encodes prior beliefs over plausible world models, while $P(E\,|\,T)$ and $P(L\,|\,T)$ measure the consistency of theory $T$ with experiential and linguistic evidence, respectively. As illustrated in Figure 1b, the agent continuously refines its beliefs, incrementally integrating new data from experience and reinterpreting language to narrow down its hypothesis space. In the following subsections, we detail: 1) the search space of possible world models, 2) the likelihood functions that quantify the fit of experience and linguistic guidance, and 3) the inference algorithm that efficiently approximates the posterior distribution over possible worlds.

A space of possible worlds. World models are programs specifying the transition function and reward structure, while inference consists of updating a posterior distribution over these executable programs. Each candidate world model, or theory $T$ , is represented as a VGDL program that specifies object types for each object color (e.g. missile, shooting avatar), interaction effects between objects (e.g. collision with yellow kills avatar), reward functions (e.g. +1 when avatar kills green), and win/loss conditions (e.g. kill all green), see example in Figure 2 and complete list of VGDL primitives in Appendix Section C. We define a simplicity-biased prior $P(T)$ over the search space, favoring theories with fewer rules: 1) object types are uniformly distributed, 2) any object pair has a $p=0.25$ chance of interacting, with interaction types sampled uniformly, and 3) each object’s death has a $p=0.1$ chance of contributing to win or loss conditions. Sampling a theory from this prior involves generating object types, object pair interactions, and win/loss conditions accordingly. These theories are executable: they can be compiled into playable games that lets the agent simulate trajectories internally. Play them yourself on our demo website.

Likelihood from experience. To estimate the likelihood $P(E\,|\,T)$ , we first decompose the agent’s experience $E$ — a sequence of symbolic state transitions — into a sequence of discrete events $e_{i}$ (object movements, appearance or disappearance, rewards, and win/loss events). We assume that these local events are conditionally independent given the theory $T$ , which lets us factorize the likelihood as $P(E\,|\,T)=\prod_{i=1}^{n}P(e_{i}\,|\,T)$ . Because candidate world models are executable, we can estimate $P(e_{i}\,|\,T)$ through simulations. Specifically, we replay the agent’s actions under $T$ by initializing the game engine to the agent’s previous state and executing its chosen action. We then track the occurrence of each observed event $e_{i}$ across $20$ independent simulations, using their frequencies as empirical estimates of $P(e_{i}\,|\,T)$ .

Likelihood from language. Linguistic advice received from other agents serves as evidence for evaluating candidate theories $T$ , modeled through Bayesian Theory-of-Mind (Baker et al., 2011). We formalize $P(L\,|\,T)$ as the probability that a speaker, believing $T$ to be true, would produce the observed message $L$ . To approximate this, we use a language model (LLaMA-3.1-70B) as a probabilistic speaker model. Given a description of $T$ , the LM is prompted to generate advice for a future player, see prompt in Appendix J. The likelihood is then estimated as the LM’s probability of producing the exact message $L$ : $P(L\,|\,T)\approx P_{\text{LM}}(L\,|\,\text{prompt}(T))$ . This approximation measures how well $T$ explains the speaker’s linguistic behavior: e.g. if $L$ contains the advice “avoid yellow at all cost!”, then $P(L\,|\,T)$ will be higher if $T$ contains the rule “yellow kills avatar,” than if it does not.

Although $P_{\text{LM}}(L\,|\,\text{prompt}(T))$ is not an accurate model of human speakers, our inference procedure relies only on relative likelihoods across theories. What matters for the posterior is the pattern of likelihood differences between candidate theories, not the absolute scale. This use of approximate generative models is common in Bayesian cognitive modeling, where the goal is to capture how linguistic evidence shifts beliefs across hypotheses rather than to recover exact speaker probabilities.

Inference algorithm. The space of possible theories is vast — exceeding $10^{20}$ configurations for games with just five objects — making exact Bayesian inference intractable. To approximate the posterior distribution $P(T\,|\,E,\,L)$ , we use a particle filter with Metropolis rejuvenations (Metropolis et al., 1953; Chopin, 2002). We maintain a population of $M=20$ candidate theories and iteratively refine them by: (1) resampling theories proportional to their posterior probability, and (2) proposing local modifications guided by observed events and linguistic guidance (see details in Appendix Section E). These modify exactly one rule at a time: an object’s type, the interaction between a pair of objects, a win condition, a loss condition, or a reward function. They are accepted in proportion to the ratio of posterior probabilities between modified and original theories: $p_{\text{accept}}=min(1,P(T^{\prime}\,|\,E,\,L)/P(T\,|\,E,\,L))$ (Metropolis et al., 1953). This process efficiently approximates the posterior distribution $P(T\,|\,E,L)$ over candidate theories, with each particle $T_{i}$ assigned a weight $w_{i}$ proportional to its posterior probability given experience and linguistic guidance. The resulting distribution represents the agent’s belief over possible game dynamics and objectives. The agent executes one inference step (1 resampling step + 5 rejuvenation steps) every 20 environment steps, and 20 inference steps every time a new kind of object appears in the scene, or when the agent dies; see full pseudo-code of the inference algorithm in Appendix Section D.

Language-guided proposals. We use the same LM to bias the proposal of game rules — e.g. after receiving the message “yellow kills you,” the LM may propose rules capturing this lethal interaction:

[TABLE]

where $r_{\text{new}}$ is a candidate rule (e.g. “yellow kills avatar”), $r_{\text{old}}$ is the current rule, $P_{0}$ is the base proposal distribution, and $P_{\text{LM}}$ is the probability the language model assigns to that rule given the message. This is implemented by prompting the LM with the received message $L$ and instructing it to answer multiple-choice questions about specific VGDL rules: e.g. does the yellow object: 1) kills the avatar, 2) steps back against the avatar, etc.; see detailed prompt in Appendix Section J. This process biases inference towards theories containing rules most compatible with received advice, resulting in faster convergence.

3.2 Goal-directed planning and strategic exploration

To maximize its expected long-term utility, the agent must balance exploration — gathering information to refine its world model — and exploitation — leveraging its current understanding to achieve game objectives. Planning is guided by the maximum a posteriori (MAP) theory $T_{\text{MAP}}$ , inferred during Bayesian updates: the agent’s best estimate of game rules and objectives. Based on $T_{\text{MAP}}$ simulations, the agent selects high-level goals, and plans action sequences to achieve them.

Goal sampling. Based on $T_{\text{MAP}}$ , the agent defines a space of high-level goals as object-object interactions that it can cause in the environment: collisions between the agent, something it can push or shoot, and any other object. Each goal is assigned two values: 1) an exploitation value reflecting its contribution to game objectives (e.g. higher if it is thought to trigger a reward or a win), and 2) an exploration value representing its potential to reduce model uncertainty, measured as the disagreement about what would happen across the $M=20$ candidate world models. Subgoals are sampled in proportion to their combined value, balancing both learning and game progress.

Action planning. To achieve these goals, the agent optimizes 10-step action sequences using $T_{\text{MAP}}$ for simulation. Initial action plans are refined through a simple genetic algorithm to maximize both game rewards and progress toward goals. To prevent catastrophic errors, the agent performs ten 3-step lookaheads to detect possible deaths or major deviations from the expected reward, triggering replanning when necessary. More details about planning can be found in Appendix Section F.

3.3 Generating linguistic guidance for others

The agent generates linguistic advice for future players by sampling from the same speaker model used to evaluate language likelihood during inference — effectively translating its MAP theory $T_{\text{MAP}}$ into natural language $L_{\text{generated}}\sim P_{\text{LM}}(L,|,\text{prompt}(T_{\text{MAP}}))$ . This provides optimal speaker modeling when message emitters are computational agents and approximate modeling when they are human. By using the LM both to interpret linguistic guidance and to generate it, the agent captures key aspects of Bayesian Theory of Mind — modeling how humans communicate their beliefs and how they interpret the beliefs of others through language.

3.4 Baseline models

We compare our approach to three baselines: 1) Oracle: a model that plans to solve the game using ground-truth game rules, 2) Deep RL: a Double Deep Q-Network agent implementing pure trial-and-error learning without structured representations (Mnih et al., 2015; Van Hasselt et al., 2016), and 3) pure LM: an LM agent (LLaMA-3.1-70B) that leverages state-of-the-art techniques to scaffold long-term decision making: ReAct approach (Yao et al., 2022), use of a scratch pad (Nye et al., 2021) and chain-of-thought reasoning towards beliefs updating and plan formation (Wei et al., 2022); see detailed prompt in Appendix Section J). These baselines test the importance of structured representations and goal-directed planning. Note that we do not compare to language-conditioned RL baselines (Luketina et al., 2019; Colas et al., 2022). These methods rely on thousands of episodes of paired (state, language) supervision to learn how to map linguistic inputs to action policies, and they do not generalize to new, idiosyncratic messages seen only once. In our design, each player receives a single novel message per game — mirroring human one-shot social learning — so language-conditioned RL would receive no opportunity to learn from language and would behave identically to pure deep RL, which we already include as a baseline.

4 Experimental paradigm

To investigate how humans and computational models integrate experiential and linguistic evidence during learning, we conducted a series of IRB-approved experiments comparing three learning conditions (see Figure 1a):

Experience: Players learn solely through direct interaction with the game. 2. 2.

Experience + human message: Players receive additional advice from previous human players. 3. 3.

Experience + model message: Players receive additional advice from previous model players.

This design allows us to examine both the effectiveness of linguistic guidance and potential asymmetries in human-model knowledge transfer. In each condition, players had 15 lives to solve four levels of each game, advancing only after completing the current level.

Participants. We recruited 122 participants through Prolific to play 5 randomly-assigned games. To ensure task engagement while maintaining a representative sample, we excluded participants who failed to complete at least one level in $\geq 3$ games (final N=120). Participants were randomly assigned to one of the 3 conditions (N=40 each). In social conditions, each participant received advice from a randomly-selected previous player (either human or model) who had completed the game in the experience-only condition (1-to-1 mapping).

Procedure. All participants first completed a brief tutorial game to familiarize themselves with the interface and basic game mechanics. In social conditions, participants read advice from previous players before starting each new game, and during gameplay. After completing each game (either by winning or depleting lives), participants wrote advice “to help future players who have not yet played the game.” This prompt encouraged participants to distill their learned knowledge into linguistic guidance, see full instructions in Appendix Section G.

Analysis approach. To evaluate learning efficiency, we tracked both the number of lives required to complete each level and the total proportion of levels completed. We use a normalized area-under-curve (nAUC) metric ( $\in[0,1]$ ) to integrate both proficiency and learning efficiency. To analyze message effectiveness, we manually coded advice content along four dimensions: the fraction of useful information about game dynamics, about loss conditions, about win conditions, and the presence of incorrect information. This coding scheme lets us examine how specific types of linguistic guidance shape exploration and learning outcomes. We will report differences between conditions as $\Delta$ (nAUC).

Computational simulations. We conducted 20 simulation runs per condition, matching the human sample size to enable direct comparison. The model received the same information as human participants: in social conditions, it processed the same messages (human- or model-generated) that humans received, while in the experience-only condition, it learned purely through interaction. We also ran iterated learning experiments where a sequence of 10 agents, each given two lives, played the game and passed a message to the next agent. This design tests whether partial knowledge can accumulate incrementally across generations, mirroring human cultural learning (Tessler et al., 2021).

5 Results

We first examine how humans and models learn novel games from pure experience, before analyzing how linguistic guidance shapes this process. We then look at human–model transfer, before exploring how our model can accumulate partial knowledge across generations. Our results reveal both striking similarities and systematic differences in learning strategies across humans and models.

5.1 Learning from experience

How efficiently can humans and models learn novel game dynamics through pure experience? Both demonstrated remarkable sample efficiency, with median participants solving 9 of 10 games and our model solving all 10 games within a 10-life budget (Figure 3). However, systematic differences emerged in games requiring specific cognitive capabilities (see per-game plots in Appendix Figure 7). In relational, which demands systematic exploration of object combinations, humans showed a bimodal pattern: 25% achieved model-like efficiency by systematically testing interactions, while 40% failed to solve even two out of four levels. This split suggests that while humans can perform systematic experimentation, not everyone defaults to it — unlike our model which explicitly reasons about information gain. Conversely, in avoidGeorge, which requires rapid planning to protect allies, models consistently outperformed humans (median levels: 4 vs 0), likely due to their capacity for accurate short-term planning.

The importance of structured reasoning becomes clear when comparing against baselines. Pure deep RL (double DQN) failed to solve any level within 10 lives, while pure LM agents never solved more than one level per game, often solving none (7/10 games). The stark difference between these baselines and both human and model performance underscores the value of structured theories of game dynamics in supporting efficient exploration and decision-making. Knowing the model of the world (oracle model) lets agents solve most games within their first life (see Figure 3 and Appendix Figure 7).

5.2 Learning from experience and human language

Having established baseline learning capabilities, we next examine how linguistic guidance from previous human players shapes exploration and learning outcomes. Our results demonstrate substantial benefits from social learning while revealing key patterns in effective knowledge transmission.

Benefits of linguistic guidance. How does linguistic guidance shape learning? Both humans and our model showed significantly faster learning when provided with human-written advice, reducing median attempts needed by 1.75 for humans (4 $\to$ 2.25) and 1.25 for models (2.5 $\to$ 1.25) (see Figure 4). To quantify these benefits in terms of learning speed, we computed a normalized area under the learning curve (nAUC). A fixed-effect model controlling for game difficulty revealed significant improvements from linguistic guidance for both humans ( $\Delta(\text{nAUC})=0.12$ , $p=2.2\times 10^{-4}$ ) and models ( $\Delta(\text{nAUC})=0.04$ , $p=3.3\times 10^{-2}$ ). These benefits were most pronounced in games with opaque mechanics like relational, or multiple hazards like portals and jaws, where guidance could directly communicate critical interactions. Benefits were smallest in games requiring primarily motor skills like aliens, missileCommand or plaqueAttack.

What makes effective guidance? Analysis of message content revealed systematic patterns in communication and message effectiveness. Most messages (88%) contained information about game dynamics (“[avoidGeorge] light blue can transform green into them”), with many also including information about win (64%, e.g. “[jaws] The objective is to stay alive”) and loss conditions (74%, e.g. “[avoidGeorge] you lose if all squares get turned purple”). A small but notable fraction (11%) contained errors (e.g. “[relational] Push all the blue into the orange,” when blue should contact yellow). Longer messages and those containing detailed win conditions proved particularly beneficial to both models and humans( $p=6.3\times 10^{-3}$ and $p=0.031$ respectively). Importantly, messages that helped humans also helped models, shown by significant correlations in performance gains across nAUC ( $r=0.17$ ), lives to first level ( $r=0.24$ ), and lives to second level ( $r=0.16$ , all $p<0.02$ ) — suggesting shared mechanisms for integrating linguistic and experiential knowledge.

How language shapes exploration. Linguistic guidance systematically altered how players explored game mechanics. Messages warning about dangers significantly reduced costly mistakes: in average, humans and models receiving such warnings experienced between 37% and 67% fewer deaths in avoidGeorge, relational, jaws, and aliens. Messages about key mechanics accelerated their discovery: in relational, players informed about tool creation discovered essential combinations 43% to 83% faster, while in plaqueAttack they learned to revive allies 43% to 62% faster when told about this possibility. However, incorrect advice could also mislead: in avoidGeorge, the human player and the model who were wrongly warned about the danger of green blocks (in fact harmless) completely avoided them for the first three episodes, while others players interacted with them in average 5 times in the same period — demonstrating how unhelpful linguistic advice can also shape direct experience.

5.3 From cognitive models to learning partners?

Our model demonstrated the ability to efficiently learn from human-generated advice. But can it also help humans in return? Can it generate guidance that is useful to other models and, more importantly, to humans? Leveraging its LM-based advice generation capability (Section 3.3), our model produced detailed, pedagogical advice that rivaled human teaching. For example, in preconditions:

“Control the darkblue square with arrow keys. Your goal is to kill all gold objects by touching them, earning points along the way. Watch out for green objects — touching them will kill you unless you have white resources to protect yourself. Collect white resources to safeguard against green, and use them to kill green objects if needed, but be aware that each kill will cost a resource.”

Model-generated guidance significantly improved learning for both humans ( $\Delta$ (nAUC) $=0.15$ , $p<10^{-5}$ ) and models ( $\Delta$ (nAUC) $=0.088$ , $p<10^{-8}$ ) compared to experience alone. Interestingly, models learned better from model- than human-generated advice ( $\Delta$ (nAUC) $=0.052$ , $p<10^{-10}$ ), while these gave humans only a modest advantage ( $\Delta$ (nAUC) $=0.035$ , $p=0.26$ , n.s.).

These asymmetries reveal important differences in human and model communication. When we had our model generate advice based on human players’ trajectories, models still learned better from this model-generated guidance than from human-written messages ( $\Delta$ (nAUC) $=0.21$ , $p<10^{-8}$ ). This suggests the asymmetry stems not from differences in knowledge, but from communication style. Human messages often included metacognitive strategies (“take time to look for safe patterns”), analogies (“orange is like terminator”), or emotional content (“[I] was left very confused”) — aspects that human learners readily use but our model finds harder to interpret. These differences highlight the potential and challenges for language-mediated human-machine collaborative learning. We show more examples of human- and model-generated messages in Appendix Section I and on our website.

Ablating language-guided proposals in our model leads to a significant performance drop compared to the full version ( $\Delta(\text{nAUC})=-0.058$ , $p=7.2\times 10^{-4}$ ). This ablated version still leverages language likelihood, which lets it outperform learning from experience alone ( $\Delta(\text{nAUC})=-0.030$ , $p=5.7\times 10^{-3}$ ), see Method Section 3.1 and Appendix Figure 8.

Variability across games: Our results reveal interesting patterns in the success and failures of linguistic guidance (Figure 5(b)). Models learning from other models show consistent benefits across all games because models generate more comprehensive messages and process advice more reliably than humans on average. In contrast, human learners show minimal improvements in games requiring rapid reactions and precise motor control (portals, plaqueAttack, aliens, missileCommand), where linguistic advice cannot substitute for motor practice. Models struggle to learn from humans in relational, where humans often provide imprecise descriptions of complex rule interactions, and plaqueAttack, where humans frequently omit important mechanics they discovered. In contrast, all learners benefit from both human and model advice in games like avoidGeorge, where the critical strategy is non-obvious, and beesAndBirds, jaws, and pushBoulders, where key dangers are memorable and straightforward to describe. These patterns suggest that social learning strategies like selecting pedagogical teachers or aggregating multiple sources of advice could mitigate some failure cases.

5.4 Generational learning

While the previous experiments allowed social learners to benefit from fully explored game mechanics, real-world learning often relies on partial and imperfect knowledge transmission. To investigate whether our model could replicate this gradual accumulation of knowledge, we designed an iterated learning experiment inspired by Tessler et al. (2021). In this setting, each agent interacts with the game environment for only two lives before generating advice to the next agent. This cycle continues across 10 generations for each of the 10 games, with performance tracked generation by generation.

Our results show that performance reliably increases across generations in all games where models do not already achieve mastery from generation 1 (preconditions and aliens) (Figure 6). Some games reached near-complete mastery by Generation 2, while others showed more gradual improvements. A fixed-effect model controlling for game variability confirmed this trend, showing significant improvements over the first generation for all others ( $\Delta(\text{nAUC})_{i>1}\in[0.44,0.57]$ , all $p<10^{-10}$ ).

However, in plaqueAttack and relational, we observe occasional regressions where later generations underperform earlier ones, highlighting the brittleness of single-teacher transmission. These regressions stem from structural properties of the games. Relational requires coordinating many interdependent transformation rules, and even when players know the correct rules, a single mis-push can irreversibly block progress within the two-life limit, making performance volatile across generations. PlaqueAttack involves fast-paced action with two different viable survival strategies (eliminating attackers or reconquering damaged bases). Messages can describe only one of these strategies, which can inadvertently steer later generations away from exploring the alternative, resulting in intermittent drops in performance. These task-specific constraints explain why cumulative improvement is less stable in these games than in others with more linguistically compressible mechanics. This phenomenon could be mitigated through the integration of multiple teachers or teacher selection (Kendal et al., 2018; Schultner et al., 2024). Together, these findings indicate that agents can build upon fragmented experiences to gradually refine their world models over generations, mirroring learning dynamics observed in human populations (Tessler et al., 2021).

6 Discussion

Our approach aligns with a longstanding research program in which human mental representations are modeled as structured, program-like generative theories, and learning is understood as probabilistic inference over these structures (Griffiths et al., 2010; Lake et al., 2017; Rule et al., 2020). This perspective has been highly successful in explaining human causal reasoning (Tenenbaum et al., 2006; Griffiths and Tenenbaum, 2009), intuitive physics (Battaglia et al., 2012; Smith et al., 2023), or social reasoning (Baker et al., 2011; Ying et al., 2024): people form hypotheses about latent mechanisms, simulate their consequences, and revise them in light of new evidence. Executable, program-like world models are a natural next step in this line of work (Tsividis et al., 2021), and our contribution can be seen as extending this framework to the domain of social learning by integrating linguistic guidance and direct experience within a unified inferential model.

Our experiments showed that our computational model reproduces key features of human social learning: advice reduces the attempts needed for success by shaping exploration, supports generational knowledge accumulation, and even allows model-generated guidance to help human learners—demonstrating bidirectional knowledge exchange.

While VGDL is only a coarse approximation of human game representations, our results show it captures enough structure to study social learning and afford bidirectional human–model knowledge transfer. Future work could leverage library-learning methods (Ellis et al., 2020; Wong et al., 2021) to model the emergence of shared representations through linguistic interaction, potentially driving representational convergence rather than requiring pre-aligned representations.

Our results show that linguistic guidance often speeds and safeguards exploration, but when advice is wrong it can restrict search — mirroring the “double-edged sword of pedagogy” observed in developmental psychology (Bonawitz et al., 2011). Humans counter this issue by evaluating testimony against prior causal theories before integrating it (Harris et al., 2018; Sobel and Kushnir, 2013). To match this sophistication, computational models will need meta-cognitive mechanisms to judge the reliability of advice and adjust their exploration accordingly.

This paper opens several avenues for future work. Human-generated messages often include game abstractions, high-level strategies, and planning heuristics that our model currently cannot leverage. Extending the framework to interpret and learn from these richer forms of guidance — e.g. through inference of auxiliary reward functions, planning abstractions and strategies — could unlock enhanced social learning capabilities (Silver et al., 2024, e.g., ). It would also be valuable to examine in more detail which linguistic abstractions — beyond rule and win/loss information — facilitate robust generational transfer, building on methodologies from prior human iterated-learning studies in VGDL environments (Tessler et al., 2021). Beyond passive learning, our model could be further extended to make decisions about who to learn from based on perceived expertise or success — a capacity known as prestige-based social learning (Kendal et al., 2018; Schultner et al., 2024). Future work could also investigate how different LLM families behave within our framework in their dual roles as advice generators (speakers) and approximations of human speakers (speaker models). Because our method uses LLMs both to produce pedagogical messages and to evaluate the likelihood of human-written advice, comparing families with different inductive biases could reveal how model-specific language priors shape both message interpretation and learning outcomes.

From an artificial intelligence perspective, an important direction for future work is extending this framework to real-world, continuous, and more complex environments. Doing so will require advances in program synthesis, scalable probabilistic inference, and hardware capable of performing inference over rich, unstructured programming languages such as Python (Tang et al., 2024; Lehrach et al., 2025). This is an active and rapidly growing area of research,222e.g., see recent library scaling probabilistic programming with GPUs via Jax. with significant investment and recent progress in LLM-driven code generation and executable world modeling (Cusumano-Towner et al., 2019; Lew et al., 2023; Loula et al., 2025). Although this may seem challenging, humans routinely construct rich executable models in code — physical engines, video game environments, simulation of complex systems — which allow them to reason about highly complex processes, explore counterfactual scenarios, and deepen their understanding of the world. These practices illustrate the feasibility and potential benefits of scaling program-like world models to richer domains.

Finally, our results point to exciting possibilities for human-machine collaborative learning. Our model not only benefits from human-generated guidance but also contributes back through effective, pedagogical advice — closing the collaborative loop. This demonstrates a first step toward bidirectional learning systems capable of supporting human learners. The future directions outlined above — handling richer linguistic guidance, adaptive trust in information sources, and scaling to open-ended domains — would represent major steps toward AI systems that not only learn efficiently but also teach, collaborate, and adapt within complex social learning networks, augmenting collective intelligence in hybrid human–AI communities (Colas et al., 2022; Brinkmann et al., 2023; Collins et al., 2024).

Reproducibility statement

Section 3 details the full model specification and inference procedure. Section 4 describes the human and model experimental design. The appendices provide additional details including: the description of all VGDL primitives, the inference pseudo-code, details about guided proposals and planning, instructions used for human data collection and full prompts used by the LM components of our model. The codebase will be released at github.com/ccolas/language_and_experience. Together these resources allow full replication of the experiments.

Acknowledgements

Cédric Colas is partly funded by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 101065949. This project was supported by a Intel and the National Science Foundation under grants CCF-2217064 and IIS-2212310. Research was additionally sponsored by the Department of the Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Department of the Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

Author Contributions

CC led the development of the computational model, conducted simulation experiments, and wrote the initial draft of the paper. TM and BP led the design and implementation of the human experiments, including participant recruitment and data collection. CC, MHT, NG, JA, and JT contributed to the formulation of research questions, the high-level design of the computational framework, and editing the manuscript. All authors provided feedback and helped shape the final version of the paper.

Appendix contents

Appendix A Extended related work

Language and RL. Language has emerged as a powerful tool to enhance reinforcement learning (RL) by conveying state abstractions (Narasimhan et al., 2018), world dynamics (Zhong et al., 2020), auxiliary reward functions (Goyal et al., 2019), and task decompositions (Shridhar et al., 2021; Sharma et al., 2021). Its inherent compositionality and abstraction capabilities allow agents to pursue more abstract goals (Jiang et al., 2019), generalize effectively across diverse environments (Zhong et al., 2020) and goals (Hill et al., 2020; Colas et al., 2020), and structure long-term decision-making more effectively (Hu et al., 2019; Chen et al., 2021). Despite these advances, most language-based RL approaches rely on substantial amounts of paired data to ground linguistic information in agent experience, limiting their scalability and resemblance to human-like social learning. Language models (LMs) promise more flexible combination of language and decision-making (Ahn et al., 2022; Huang et al., 2022), yet they struggle to learn complex embodied skills that require low-level perception and temporally extended actions (Valmeekam et al., 2023; Paglieri et al., 2024). Nottingham et al. (2023) use LLMs to generate complete world model hypotheses that are then verified through experience. Our approach introduces a Bayesian framework that treats experience and language as two complementary sources of evidence in the inference of world models. This joint inference enables rapid adaptation to new linguistic inputs and tasks from the very first interaction, bypassing the need for extensive paired data.

Bayesian models of social cognition. Bayesian models of social cognition have proven to be powerful tools for modeling theory of mind (ToM)—the ability to infer the hidden goals, beliefs, and intentions of others from observable behavior or linguistic cues (Baker et al., 2011; 2017; Frank and Goodman, 2012; Goodman and Frank, 2016). These models formalize social inference as an inverse planning problem, where observers assume that agents act approximately rationally towards their goals and use this assumption to infer likely mental states (Baker et al., 2011; 2017). Extensions of these frameworks to language understanding have resulted in the Rational Speech Acts model, which interprets communication as recursive social reasoning: listeners reason about what a speaker intends to convey based on the assumption that speakers choose utterances optimally, given their own beliefs (Frank and Goodman, 2012; Goodman and Frank, 2016). More recent work integrates Bayesian reasoning with modern machine learning to enable richer social inferences. For instance, (Zhi-Xuan et al., 2023) leverage large language models (LLMs) as priors in a Bayesian goal inference system, allowing for the efficient suggestion and evaluation of likely goals in complex environments. Similarly, (Ying et al., 2024) introduce a language-augmented ToM model that translates natural language statements about beliefs into formal epistemic representations, enhancing agents’ ability to reason about others’ knowledge and intentions. By combining linguistic input with structured probabilistic models, these approaches extend traditional ToM beyond purely behavioral cues, opening new avenues for interactive and socially aware AI agents (Vélez and Gweon, 2021). However, these models operate in settings where the transition dynamics are fully known and deterministic, and inference is restricted to identifying a latent goal or belief state from demonstrations. In contrast, our setting requires agents to infer the entire causal structure of each new environment — object types, interaction rules, and win/loss conditions—directly from experience, making joint inference from language and action substantially more challenging. Our work builds on these ideas by leveraging Bayesian inference to jointly interpret linguistic and experiential data, allowing agents to efficiently acquire world models from sparse social interactions.

Appendix B Learning from experience alone

Appendix C VGDL: Game primitives, state and action spaces

We work with a subset of the VGDL domain. Here are the possible types of avatars:

•

MovingAvatar: controllable player that can move in the four directions with a certain speed based on keyboard presses

•

ShootAvatar: MovingAvatar that can also shoot objects stype when the player presses the space bar

•

FlakAvatar: ShootAvatar that can only move sideways and always shoot upwards.

Here are the possible object types and their parameters:

•

Immovable: object that cannot move

•

Flicker: object that disappears after total steps

•

SpawnPoint: object that spawn objects stype with probability p

•

ResourcePack: object that can be collected (see interaction addResource and removeResource)

•

Passive: object that can be pushed (see interaction bounceForward)

•

Missile: object that moves in one direction with a certain speed and an original orientation. They can change direction (see interactions turnAround and reverseDirection

•

Bomber: the combination of a missile and a spawner (with their combined parameters)

•

Chaser: object that moves in the direction of the nearest target object stype, with a certain speed

•

RandomNPC: object that moves in a random direction with a certain speed

•

Portal: object that can teleport another object contacting it to another exit object (see interaction teleportTo).

All moving objects move every cooldown environment steps.

Interactions describe what happens when two objects contact. Here are the possible interaction types and their parameters:

•

noInteraction: nothing happens

•

killSprite: the second object kills the first object

•

transformTo: the second object transforms the first object into a third object stype

•

removeResource: the second object decreases the count of resource from the first object

•

addResource: the second object increases the count of resource of the first object

•

killIfHasLess: the second object kills the first if it has less than 1 resource stype

•

stepBack: objects step back (second steps back first, if not possible the second does)

•

bounceForward: the second object pushes the first if possible (e.g., unless it is blocked)

•

turnAround: the first object (a missile) does one block down and switches direction when encountering the second object

•

reverseDirection: the first object (a missile) reverses direction when encountering the second object

killSprite and transformTo interactions can further lead to a positive (+1), negative (-1) or null (0) reward.

Lose conditions can be:

•

Timeout: the player loses if it runs out of time before solving the task

•

CountIsZero(objs): the player loses if at least one of the objects objs has no remaining instances in the game.

Win conditions can be:

•

Survive: the player wins if it survives long enough

•

CountIsZero(objs): the player wins if the count of all objects objs goes to zero (e.g., they were all killed or disappeared)

Size of the search space and game difficulty.

The size of the search space is a direct function of the number of objects. The avatar must have an avatar type among 3, each other object should have an object type among 10; each pair of objects must have an interaction type among 10; win and lose conditions can apply to any list of objects from 0 to all. Without accounting for type parameters (e.g., chasing object have a target, transform interaction transform objects into a specific type, etc), game spaces already scale $10^{30}$ for the smallest games involving 5 objects (missileCommand, preconditions). Other games might have up to 10 objects (plaqueAttack, portals). The difficulty of a game is not necessarily correlated to the size of the search space. For instance, some object types can be inferred quickly from movement observations, while others require direct interactions. The layout and game rules might also make exploration, or planning more or less complex for different games.

Perception and action. Players have five possible actions: four directional moves (arrow keys) and a shoot action (space bar) when the game includes a ShootAvatar or a FlakAvatar. Computational agents also have access to a NOOP action, while human players can simply choose not to act. After taking a non-NOOP action, computational agents must wait for 4 environment steps before being able to take a new one, a way of capturing human reactive time (5 fps). The game runs at 20 frames per second. Humans perceive the game visually, with the current reward displayed at the top (see Figure 2), whereas computational agents perceive symbolic states consisting of object properties (object id, color, position), rewards, and win/loss events.

Appendix D Inference pseudo-code

Algorithm 1 shows the pseudo-code of the inference algorithm — a particle filter with Metropolis rejuvenations — used to update the agent’s beliefs about possible games it might be playing, supported by experiential and linguistic evidence (Del Moral et al., 2006; Metropolis et al., 1953). The Perturb operator replaces one rule of the VGDL game description at a time, according to guided proposals (see Appendix Section E below). We picked the number of particles to be $N=20$ .This involves a trade-off:

•

Exploration: too few theories lead to under-representation of uncertainty, reducing the model’s ability to identify informative subgoals and to explore sufficient alternative variations in the space of possible theories.

•

Computation: each Metropolis update requires updating all particles and computing language likelihoods via the LLM, which is the computational bottleneck of the approach.

Empirically, 10 particles was faster but sometimes failed to discover high-posterior theories quickly enough for the agent to survive, while 20 ensured a more robust and fast exploration of the theory space. More particles might slightly improve the performance of the inference, but will make the system slower to run and experiment with.

Appendix E Guided proposals

We use biased proposals to initialize the set of 20 candidate theories and to generate rejuvenation moves (Perturb operator in Algorithm 1). These proposals are guided by both experience and linguistic evidence, accelerating convergence towards theories that better explain the agent’s observations.

Experience-driven proposals bias the sampling process in the following ways:

•

objects observed to have moved cannot be assigned an object type incompatible with movements (e.g., Immovable, or Flicker),

•

objects moving in one direction are more likely to be assigned type object types that move linearly (Missile and Bomber than object types that allow movements in all directions (RandomNPC,, Chaser),

•

objects pairs involved in collisions preceding observed rewards are more likely to be assigned reward-generating interactions.

Appendix F Details about planning

Our model strategically balances curiosity-driven exploration with goal-directed exploitation, mirroring human problem-solving strategies (Tsividis et al., 2021). From the current best theory $T_{\text{MAP}}$ , we generate candidate subgoals: specific collisions between pairs of objects that the agent can cause to occur by either touching another object itself, or by pushing or spawning an object onto it. Candidate subgoals are assigned values based on their exploration and exploitation potentials:

[TABLE]

The exploration value measures disagreement between theories in the current population:

[TABLE]

where $\text{count}(i\,|\,g)$ is the count of theories assigning interaction type $i$ to subgoal $g$ .

The exploitation value rewards key game mechanics:

[TABLE]

To achieve these goals, the model evolves short 10-step action sequences (e.g. move left, shoot, move up) through stochastic search, using $T_{\text{MAP}}$ to simulate outcomes. Action plans are iteratively mutated and refined over three generations using a simple genetic algorithm, where mutations crop and regrow action sequences from a uniformly sampled mid-point. Each sequence $a$ is evaluated according to:

[TABLE]

where $R_{\text{game}}$ is the game’s reward function under $T_{\text{MAP}}$ , $R_{\text{goal}}$ rewards progress toward the selected subgoal (between 0 and 1), and $R_{\text{win/loss}}$ provides +100 for winning and -100 for losing.

Additionally, the model performs 10 3-step lookaheads to avoid catastrophic errors, triggering replanning when the originally predicted value deviates significantly from the distribution of values for newly simulated trajectories:

[TABLE]

This safety mechanism prevents the agent from executing plans that appeared promising under limited simulation but fail under more extensive testing, or when unexpected changes in the environment render the original plan ineffective.

Appendix G Human data collection: instructions and participant recruitment

We recruited 122 adult English-speaking participants through Prolific to play 5 randomly-assigned games. To ensure task engagement while maintaining a representative sample, we excluded participants who failed to complete at least one level in $\geq 3$ games (final N=120). Participants’ median completion time was 49.30 minutes, and the median hourly pay rate was $\mathdollar 10.41$ /hr.

Before playing the games, participants read the following instructions:

Instructions for human data collection When starting each game, participants in Condition 1 received no game specific advice, as shown in Figure 9, while participants in Conditions 2 and 3 saw a message from a previous player, as shown in Figure 9. Players in Conditions 2 and 3 can still read the message as they play. After completing each game, participants in all conditions were asked to describe the game to a future player, as shown in Figure 9.

Appendix H Computational resources

The model requires a GPU to run the LM, we use one NVIDIA A100 (80Gb). We use prompt caching using the vLLM library to speed up the generation of proposals. Simulation runs vary in function of game complexity (slower with more objects) and the learning speed of the agent (runs end early when all levels are solved), taking anywhere between an hour and a day, depending on games and conditions.

Appendix I Example messages

Below are example messages generated by humans and model players in the social learning experiments.

Example of human messages in social learning experiment

Example of model messages in social learning experiment

Appendix J Prompts

Prompt for language likelihood estimation and language generation

Prompt for making rules proposals

Prompt for the pure-LLM baseline

Appendix K Baseline comparisons

Figure 10 shows the performance of the deep RL baseline when run for up to 2,000 episodes.

For completeness, we considered whether similar long-horizon runs should be performed for the LLM agent. This is not feasible for several reasons. First, the LLM agent is extremely expensive to execute: it produces a chain-of-thought plan at every time step, and concatenating these traces quickly saturates the context window, preventing the model from carrying information across many episodes. Second, longer training is unlikely to change outcomes. Inspection of the model’s reasoning traces shows persistent difficulties in (1) inferring rules from observations, (2) forming coherent multi-step plans, and (3) executing those plans reliably. These limitations mirror recent findings on LLM performance in long-horizon video-game benchmarks (e.g., Balrog AI (Paglieri et al., 2024)), where LLMs perform poorly even when given full rule descriptions. VGDL games are even more challenging, as each new game introduces novel rules that must be inferred from scratch. For these reasons, long-horizon rollouts are not expected to materially improve the LLM baseline.

Bibliography68

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, et al. (2022) Do as i can, not as i say: grounding language in robotic affordances . ar Xiv preprint ar Xiv:2204.01691 . Cited by: Appendix A .
2K. Allen, F. Brändle, M. Botvinick, J. E. Fan, S. J. Gershman, A. Gopnik, T. L. Griffiths, J. K. Hartshorne, T. U. Hauser, M. K. Ho, et al. (2024) Using games to understand the mind . Nature Human Behaviour , pp. 1–9 . Cited by: §2 .
3C. L. Baker, J. Jara-Ettinger, R. Saxe, and J. B. Tenenbaum (2017) Rational quantitative attribution of beliefs, desires and percepts in human mentalizing . Nature Human Behaviour 1 ( 4 ), pp. 0064 . Cited by: Appendix A .
4C. Baker, R. Saxe, and J. Tenenbaum (2011) Bayesian theory of mind: modeling joint belief-desire attribution . In Proceedings of the annual meeting of the cognitive science society , Vol. 33 . Cited by: Appendix A , §3.1 , §6 .
5P. Battaglia, T. Ullman, J. Tenenbaum, A. Sanborn, K. Forbus, T. Gerstenberg, and D. Lagnado (2012) Computational models of intuitive physics . In Proceedings of the annual meeting of the cognitive science society , Vol. 34 . Cited by: §6 .
6E. Bonawitz, P. Shafto, H. Gweon, N. D. Goodman, E. Spelke, and L. Schulz (2011) The double-edged sword of pedagogy: instruction limits spontaneous exploration and discovery . Cognition 120 ( 3 ), pp. 322–330 . Cited by: §6 .
7R. Boyd, P. J. Richerson, and J. Henrich (2011) The cultural niche: why social learning is essential for human adaptation . Proceedings of the National Academy of Sciences 108 ( supplement_2 ), pp. 10918–10925 . Cited by: §1 .
8L. Brinkmann, F. Baumann, J. Bonnefon, M. Derex, T. F. Müller, A. Nussberger, A. Czaplicka, A. Acerbi, T. L. Griffiths, J. Henrich, et al. (2023) Machine culture . Nature Human Behaviour 7 ( 11 ), pp. 1855–1868 . Cited by: §6 .