Audio Description from Image by Modal Translation Network
Hailong Ning, Xiangtao Zheng, Yuan Yuan, Xiaoqiang Lu

TL;DR
This paper introduces the I2AD task, generating audio descriptions from images to assist visually impaired individuals, using a novel modal translation network that learns cross-modal features and synthesizes audio.
Contribution
It proposes the first I2AD task and a new MT-Net model with three sub-networks for feature learning, cross-modal mapping, and audio generation.
Findings
Generated audio is intelligible and accurate.
The method effectively bridges the gap between visual and auditory data.
Large-scale datasets with manual audio descriptions support the approach.
Abstract
Audio is the main form for the visually impaired to obtain information. In reality, all kinds of visual data always exist, but audio data does not exist in many cases. In order to help the visually impaired people to better perceive the information around them, an image-to-audio-description (I2AD) task is proposed to generate audio descriptions from images in this paper. To complete this totally new task, a modal translation network (MT-Net) from visual to auditory sense is proposed. The proposed MT-Net includes three progressive sub-networks: 1) feature learning, 2) cross-modal mapping, and 3) audio generation. First, the feature learning sub-network aims to learn semantic features from image and audio, including image feature learning and audio feature learning. Second, the cross-modal mapping sub-network transforms the image feature into a cross-modal representation with the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
