Visual Storytelling via Predicting Anchor Word Embeddings in the Stories
Bowen Zhang, Hexiang Hu, Fei Sha

TL;DR
This paper introduces a simple, effective model for visual storytelling that predicts anchor word embeddings from images and uses them with image features to generate narratives, outperforming existing methods in automatic and human evaluations.
Contribution
The paper presents a novel approach that predicts anchor word embeddings from images to improve visual storytelling, offering a simpler and more effective model than previous state-of-the-art methods.
Findings
Achieves top results in automatic evaluation metrics.
Outperforms competing methods in human evaluations.
Model is simple, easy to optimize, and effective.
Abstract
We propose a learning model for the task of visual storytelling. The main idea is to predict anchor word embeddings from the images and use the embeddings and the image features jointly to generate narrative sentences. We use the embeddings of randomly sampled nouns from the groundtruth stories as the target anchor word embeddings to learn the predictor. To narrate a sequence of images, we use the predicted anchor word embeddings and the image features as the joint input to a seq2seq model. As opposed to state-of-the-art methods, the proposed model is simple in design, easy to optimize, and attains the best results in most automatic evaluation metrics. In human evaluation, the method also outperforms competing methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Artificial Intelligence in Games
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence
