Towards Practical and Efficient Image-to-Speech Captioning with   Vision-Language Pre-training and Multi-modal Tokens

Minsu Kim; Jeongsoo Choi; Soumi Maiti; Jeong Hun Yeo; Shinji Watanabe,; Yong Man Ro

arXiv:2309.08531·cs.CV·September 18, 2023

Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe,, Yong Man Ro

PDF

Open Access

TL;DR

This paper introduces a new image-to-speech captioning model that leverages vision-language pre-training and multi-modal tokens, achieving state-of-the-art results and high efficiency by discretizing speech and image data.

Contribution

It presents a novel approach combining vision-language pre-training with discretized speech and image units for efficient and accurate image-to-speech captioning.

Findings

01

Achieved state-of-the-art performance on COCO and Flickr8k datasets.

02

Reduced data storage requirements for images by over 99%.

03

Demonstrated effective integration of pre-trained models into speech captioning.

Abstract

In this paper, we propose methods to build a powerful and efficient Image-to-Speech captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp. We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-supervised speech model. The speech units mainly contain linguistic information while suppressing other characteristics of speech. This allows us to incorporate the language modeling capability of the pre-trained vision-language model into the spoken language modeling of Im2Sp. With the vision-language pre-training strategy, we set new state-of-the-art Im2Sp performances on two widely used benchmark databases, COCO and Flickr8k. Then, we further improve the efficiency of the Im2Sp model. Similar…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning