The BURCHAK corpus: a Challenge Data Set for Interactive Learning of Visually Grounded Word Meanings
Yanchao Yu, Arash Eshghi, Gregory Mills, Oliver Joseph Lemon

TL;DR
The paper introduces the BURCHAK corpus, a new dialogue dataset for studying interactive learning of visual words, along with a simulation framework and a reinforcement learning agent for visual word acquisition.
Contribution
It provides a novel dialogue dataset, a generic n-gram user simulation framework, and demonstrates RL-based learning of visual words from naturalistic interactions.
Findings
User simulations achieve 78% turn match similarity.
RL agent performs comparably to rule-based systems.
Dataset captures natural dialogue phenomena.
Abstract
We motivate and describe a new freely available human-human dialogue dataset for interactive learning of visually grounded word meanings through ostensive definition by a tutor to a learner. The data has been collected using a novel, character-by-character variant of the DiET chat tool (Healey et al., 2003; Mills and Healey, submitted) with a novel task, where a Learner needs to learn invented visual attribute words (such as " burchak " for square) from a tutor. As such, the text-based interactions closely resemble face-to-face conversation and thus contain many of the linguistic phenomena encountered in natural, spontaneous dialogue. These include self-and other-correction, mid-sentence continuations, interruptions, overlaps, fillers, and hedges. We also present a generic n-gram framework for building user (i.e. tutor) simulations from this type of incremental data, which is freely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
