Matching Visual Features to Hierarchical Semantic Topics for Image   Paragraph Captioning

Dandan Guo; Ruiying Lu; Bo Chen; Zequn Zeng; Mingyuan Zhou

arXiv:2105.04143·cs.CV·July 27, 2022

Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning

Dandan Guo, Ruiying Lu, Bo Chen, Zequn Zeng, Mingyuan Zhou

PDF

Open Access 1 Repo

TL;DR

This paper introduces a hierarchical-topic-guided framework for image paragraph captioning that integrates semantic topics with visual features to produce coherent, diverse, and interpretable descriptions, outperforming many existing methods.

Contribution

It develops a plug-and-play model combining deep topic modeling with visual extraction and language generation, enabling semantic coherence and interpretability in image paragraph captioning.

Findings

01

Competitive performance on public datasets

02

Ability to distill interpretable semantic topics

03

Generation of diverse, coherent captions

Abstract

Observing a set of images and their corresponding paragraph-captions, a challenging task is to learn how to produce a semantically coherent paragraph to describe the visual content of an image. Inspired by recent successes in integrating semantic topics into this task, this paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework, which couples a visual extractor with a deep topic model to guide the learning of a language model. To capture the correlations between the image and text at multiple levels of abstraction and learn the semantic topics from images, we design a variational inference network to build the mapping from image features to textual captions. To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model, including Long Short-Term Memory (LSTM) and Transformer, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dandanguo1993/vtcm-based-image-paragraph-caption
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Variational Inference · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Softmax · Layer Normalization · Label Smoothing