Diverse and Styled Image Captioning Using SVD-Based Mixture of Recurrent Experts
Marzieh Heidari, Mehdi Ghatee, Ahmad Nickabadi, Arash Pourhasan Nezhad

TL;DR
This paper introduces MoRE, a novel image captioning model that uses SVD-based mixture of recurrent experts to generate diverse, styled captions without needing styled datasets, validated on the Microsoft COCO dataset.
Contribution
The paper presents a new captioning approach combining SVD with RNNs to enhance diversity and style without additional labeled data.
Findings
Generates diverse, styled captions without styled datasets
Achieves improved content accuracy in captions
Validated on Microsoft COCO dataset
Abstract
With great advances in vision and natural language processing, the generation of image captions becomes a need. In a recent paper, Mathews, Xie and He [1], extended a new model to generate styled captions by separating semantics and style. In continuation of this work, here a new captioning model is developed including an image encoder to extract the features, a mixture of recurrent networks to embed the set of extracted features to a set of words, and a sentence generator that combines the obtained words as a stylized sentence. The resulted system that entitled as Mixture of Recurrent Experts (MoRE), uses a new training algorithm that derives singular value decomposition (SVD) from weighting matrices of Recurrent Neural Networks (RNNs) to increase the diversity of captions. Each decomposition step depends on a distinctive factor based on the number of RNNs in MoRE. Since the used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
