# Crowdsourcing a Dataset of Audio Captions

**Authors:** Samuel Lipping, Konstantinos Drossos, Tuomas Virtanen

arXiv: 1907.09238 · 2019-07-23

## TL;DR

This paper introduces a three-step crowdsourcing framework for creating an audio captioning dataset, emphasizing caption quality, diversity, and error reduction, to facilitate multi-modal audio understanding.

## Contribution

It presents a novel crowdsourcing methodology tailored for audio captioning datasets, improving caption quality and diversity through iterative editing and rating processes.

## Key findings

- The dataset has fewer typographical errors after applying the framework.
- Captions in the dataset show a Jaccard similarity of 0.24, indicating diversity.
- The framework effectively enhances caption quality and diversity.

## Abstract

Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. "people talking in a big room"). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, keeping the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.09238/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/1907.09238/full.md

## References

18 references — full list in the complete paper: https://tomesphere.com/paper/1907.09238/full.md

---
Source: https://tomesphere.com/paper/1907.09238