TL;DR
This paper demonstrates that multi-agent communication can develop disentangled, compositional representations of physical properties from video features without supervision, enabling effective reasoning and planning.
Contribution
It introduces a multi-agent framework with a Gumbel-Softmax bottleneck that learns property-specific protocols, influenced by perceptual priors, validated on both synthetic and real video data.
Findings
Agents achieve near-perfect compositionality in property representation.
Causal interventions selectively disrupt targeted physical properties.
Pretraining on different video models influences what physical aspects are communicated.
Abstract
Can multi-agent communication pressure extract discrete, compositional representations of invisible physical properties from frozen video features? We show that agents communicating through a Gumbel-Softmax bottleneck with iterated learning develop positionally disentangled protocols for latent properties (elasticity, friction, mass ratio) without property labels or supervision on message structure. With 4 agents, 100% of 80 seeds converge to near-perfect compositionality (PosDis=0.999, holdout 98.3%). Controls confirm multi-agent structure -- not bandwidth or temporal coverage -- drives this effect. Causal intervention shows surgical property disruption (~15% drop on targeted property, <3% on others). A controlled backbone comparison reveals that the perceptual prior determines what is communicable: DINOv2 dominates on spatially-visible ramp physics (98.3% vs 95.1%), while V-JEPA 2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
