Audio Description from Image by Modal Translation Network

Hailong Ning; Xiangtao Zheng; Yuan Yuan; Xiaoqiang Lu

arXiv:2103.10018·cs.SD·March 19, 2021

Audio Description from Image by Modal Translation Network

Hailong Ning, Xiangtao Zheng, Yuan Yuan, Xiaoqiang Lu

PDF

TL;DR

This paper introduces the I2AD task, generating audio descriptions from images to assist visually impaired individuals, using a novel modal translation network that learns cross-modal features and synthesizes audio.

Contribution

It proposes the first I2AD task and a new MT-Net model with three sub-networks for feature learning, cross-modal mapping, and audio generation.

Findings

01

Generated audio is intelligible and accurate.

02

The method effectively bridges the gap between visual and auditory data.

03

Large-scale datasets with manual audio descriptions support the approach.

Abstract

Audio is the main form for the visually impaired to obtain information. In reality, all kinds of visual data always exist, but audio data does not exist in many cases. In order to help the visually impaired people to better perceive the information around them, an image-to-audio-description (I2AD) task is proposed to generate audio descriptions from images in this paper. To complete this totally new task, a modal translation network (MT-Net) from visual to auditory sense is proposed. The proposed MT-Net includes three progressive sub-networks: 1) feature learning, 2) cross-modal mapping, and 3) audio generation. First, the feature learning sub-network aims to learn semantic features from image and audio, including image feature learning and audio feature learning. Second, the cross-modal mapping sub-network transforms the image feature into a cross-modal representation with the same…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.