# Audio Caption in a Car Setting with a Sentence-Level Loss

**Authors:** Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu

arXiv: 1905.13448 · 2020-10-26

## TL;DR

This paper introduces a Mandarin audio captioning dataset for car scenes and proposes a sentence-level loss with a GRU model, improving caption quality and generalization across datasets, though human annotations remain superior.

## Contribution

The paper presents a new Mandarin audio captioning dataset for car scenes and a sentence-level loss method that enhances caption quality and model generalization.

## Key findings

- Model improves all NLG metrics
- Captions show higher semantic similarity to human annotations
- Model generalizes across different datasets

## Abstract

Captioning has attracted much attention in image and video understanding while a small amount of work examines audio captioning. This paper contributes a Mandarin-annotated dataset for audio captioning within a car scene. A sentence-level loss is proposed to be used in tandem with a GRU encoder-decoder model to generate captions with higher semantic similarity to human annotations. We evaluate the model on the newly-proposed Car dataset, a previously published Mandarin Hospital dataset and the Joint dataset, indicating its generalization capability across different scenes. An improvement in all metrics can be observed, including classical natural language generation (NLG) metrics, sentence richness and human evaluation ratings. However, though detailed audio captions can now be automatically generated, human annotations still outperform model captions on many aspects.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.13448/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/1905.13448/full.md

## References

29 references — full list in the complete paper: https://tomesphere.com/paper/1905.13448/full.md

---
Source: https://tomesphere.com/paper/1905.13448