XferBench: a Data-Driven Benchmark for Emergent Language
Brendon Boldt, David Mortensen

TL;DR
This paper presents XferBench, a data-driven benchmark that evaluates emergent languages by their similarity to human language, using downstream NLP task performance as a measure of quality.
Contribution
The paper introduces a novel benchmark and Python package for assessing emergent language quality based on downstream NLP task performance.
Findings
The benchmark correlates well with human judgments of language quality.
Emergent languages show varying degrees of similarity to human language.
The package provides an easy-to-use tool for researchers to evaluate emergent languages.
Abstract
In this paper, we introduce a benchmark for evaluating the overall quality of emergent languages using data-driven methods. Specifically, we interpret the notion of the "quality" of an emergent language as its similarity to human language within a deep learning framework. We measure this by using the emergent language as pretraining data for a downstream NLP tasks in human language -- the better the downstream performance, the better the emergent language. We implement this benchmark as an easy-to-use Python package that only requires a text file of utterances from the emergent language to be evaluated. Finally, we empirically test the benchmark's validity using human, synthetic, and emergent language baselines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsScientific Computing and Data Management · Semantic Web and Ontologies · Computational Physics and Python Applications
