Learning what to say and how precisely: Efficient Communication via Differentiable Discrete Communication Learning
Aditya Kapoor, Yash Bhisikar, Benjamin Freed, Jan Peters, Mingfei Sun

TL;DR
This paper introduces an extension of Differentiable Discrete Communication Learning (DDCL) that supports unbounded signals, enabling efficient, adaptive communication in multi-agent reinforcement learning with reduced bandwidth and competitive performance.
Contribution
The paper extends DDCL to support unbounded signals, creating a universal layer for MARL that dynamically modulates message precision and reduces bandwidth without sacrificing performance.
Findings
Agents learn to adapt message precision based on task needs.
Bandwidth is reduced by over an order of magnitude while maintaining or improving performance.
A simple Transformer-based policy with DDCL matches complex architectures, questioning the need for specialized designs.
Abstract
Effective communication in multi-agent reinforcement learning (MARL) is critical for success but constrained by bandwidth, yet past approaches have been limited to complex gating mechanisms that only decide \textit{whether} to communicate, not \textit{how precisely}. Learning to optimize message precision at the bit-level is fundamentally harder, as the required discretization step breaks gradient flow. We address this by generalizing Differentiable Discrete Communication Learning (DDCL), a framework for end-to-end optimization of discrete messages. Our primary contribution is an extension of DDCL to support unbounded signals, transforming it into a universal, plug-and-play layer for any MARL architecture. We verify our approach with three key results. First, through a qualitative analysis in a controlled environment, we demonstrate \textit{how} agents learn to dynamically modulate…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Originality First, the generalization of the DDCL framework itself is a crucial technical step. Prior DDCL work was limited to positive, bounded signals, which imposed artificial architectural constraints (like requiring a sigmoid activation on policy outputs). This generalization to unbounded, signed, real-valued vectors removes those constraints, fundamentally changing DDCL from a niche technique to a universal, plug-and-play module for any MARL architecture. The authors back this up by pr
- Originality The framework's core novelty is arguably limited to the boundary conditions of the DDCL formulation. The foundational innovation was established in prior work (Freed et al., 2020c). The generalization to unbounded signals is crucial for applicability, but the mechanism itself is inherited. A more significant constraint on the originality is the sub-optimality of the derived communication cost, $L_{comms}$. The paper acknowledges that $L_{comms}$ is a surrogate loss derived using J
1. The idea and the relaxation of the assumption in communication seems to be interesting. 2. The proof sounds correct. 3. The analysis is extensive and pretty interesting.
However, I found the paper has shortcomings in: 1) the core issues (see my detailed comments below) when relaxing the assumption are not considered and addressed; 2) the theoretical analysis seems to have marginal modification from Freed's work [1,2]; 3) the claims in the experiments are too strong, which is not sufficiently evidenced by the results. [1] Discrete communication learning via backpropagation on bandwidth-limited communication networks. Master’s Thesis, Carnegie Mellon University,
- **Clear conceptual improvement.** The paper makes a clean and well-motivated generalization of Freed et al.’s DDCL to unbounded real-valued signals, removing the restrictive assumption \(z \in [0,1]\). This substantially broadens applicability to a wide range of MARL architectures. - **Strong experimental validation.** The evaluation suite is extensive, covering both controlled and high-dimensional domains. Integrations into four established MARL+Comms baselines are convincing
- **Clarity and presentation.** Figures are visually dense and captions occasionally unclear and confusing. - *Figure 1:* The caption is not well aligned with the figure. The term *“episodic plot”* is unclear, and it is not evident where “success rate remains perfect (1.0).” - *Figure 2:* The numerous STE baselines (`STE_[4,8,16]`) clutter the Pareto plots; consider reducing them or highlighting key configurations more clearly. The Pareto frontier is said to be “indicated by thick bl
1.Core Technical Contribution: A theoretically grounded and practically validated generalization of DDCL to unbounded signals. 2.Plug-and-Play Utility: Can be seamlessly integrated into multiple existing MARL algorithms without major architectural changes. 3.Clear Differentiable Objective: Well-justified communication cost term derived from information-theoretic principles. 4.Strong Empirical Results: Demonstrates bandwidth reduction of up to 10000× with competitive or superior task performance.
The paper assumes synchronized randomness between agents to reconstruct discrete messages. It would strengthen the work to analyze how DDCL performs under desynchronized or noisy communication conditions. While the current DDCL design uses a fixed uniform grid, future work could explore learned non-uniform quantization or variational encoding for further compression efficiency. The empirical evaluation could be further enhanced by testing DDCL under more dynamic or heterogeneous communication
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Age of Information Optimization · Neural Networks and Reservoir Computing
