ELCC: the Emergent Language Corpus Collection
Brendon Boldt, David Mortensen

TL;DR
The paper introduces ELCC, a comprehensive collection of emergent language corpora from various systems, enabling broader analysis and comparison of emergent communication without the need for extensive system reimplementation.
Contribution
It provides a curated, annotated dataset of emergent language corpora from diverse environments, facilitating research and comparison in emergent communication.
Findings
Demonstrated the utility of ELCC through quantitative analyses.
Showcased potential for cross-system emergent language studies.
Enabled easier access for researchers without deep learning backgrounds.
Abstract
We introduce the Emergent Language Corpus Collection (ELCC): a collection of corpora generated from open source implementations of emergent communication systems across the literature. These systems include a variety of signalling game environments as well as more complex environments like a social deduction game and embodied navigation. Each corpus is annotated with metadata describing the characteristics of the source system as well as a suite of analyses of the corpus (e.g., size, entropy, average message length, performance as transfer learning data). Currently, research studying emergent languages requires directly running different systems which takes time away from actual analyses of such languages, makes studies which compare diverse emergent languages rare, and presents a barrier to entry for researchers without a background in deep learning. The availability of a substantial…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The sources and statistics of the data are well documented. 2. The data can enable broader engagement and new research directions in ECS.
1. As the authors already mentioned in limitations, the data are not annotated, having not reference to the semantics of the communication. This limits the scope of possible analysis.
- Contribution to the reproducibility of the EC field and the ease of entry to the field. - The novelty of the idea of creating corpora of emergent languages, which had not been conceptualized before, probably due to the artificial nature of emergent languages. - Most EC papers mention previous work in the "Background" section or "Related Work" section regarding problem setting, methodology, and evaluation metrics. Interestingly, this paper additionally focuses on the quality and availability of
- ELCC is not annotated with meanings (or semantics, inputs) corresponding to messages (sentences). This would be somewhat problematic when the EC researchers want to evaluate other properties, such as compositionality. - The basic analyses provided (e.g., XferBench) are indeed valid in showing the "negative result" (i.e., that emergent language is not similar to human language). However, I am not sure if the provided metrics are still valid in showing the "positive result" in the future (i.e.,
There is a clear need to standardize various aspects of emergent communication (EC) experimental setups, and efforts in this direction such as the suggested ELCC, are valuable. The authors provide a compelling critique of the difficulties in reproducing past EC research and identify missing elements that hinder accelerated progress. They call for a more robust framework to facilitate easier reproducibility, comparison, and competition in the field, which could significantly accelerate its develo
The motivation behind this work is underdeveloped. Although standardization is beneficial, it is unclear what specific problems this work aims to solve. A clear example of ELCC’s utility, such as its potential to generate new insights, would enhance its justification. Additionally, the static nature of the ELCC collection raises questions about the kinds of research it enables and what insights it may provide beyond those already published by the corresponding papers. A leaderboard with well-def
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsFocus
