Bypass Network for Semantics Driven Image Paragraph Captioning
Qi Zheng, Chaoyue Wang, Dadong Wang

TL;DR
This paper introduces a bypass network that separately models semantics and syntax to improve coherence and reduce repetition in image paragraph captioning, achieving superior results on benchmark datasets.
Contribution
The proposed model separates semantics and syntax modeling with a bypass network, enhancing coherence and reducing repetition in image paragraph captioning.
Findings
Outperforms state-of-the-art methods on benchmark datasets.
Effectively reduces both immediate and delayed repetitions.
Achieves higher coherence without sacrificing accuracy.
Abstract
Image paragraph captioning aims to describe a given image with a sequence of coherent sentences. Most existing methods model the coherence through the topic transition that dynamically infers a topic vector from preceding sentences. However, these methods still suffer from immediate or delayed repetitions in generated paragraphs because (i) the entanglement of syntax and semantics distracts the topic vector from attending pertinent visual regions; (ii) there are few constraints or rewards for learning long-range transitions. In this paper, we propose a bypass network that separately models semantics and linguistic syntax of preceding sentences. Specifically, the proposed model consists of two main modules, i.e. a topic transition module and a sentence generation module. The former takes previous semantic vectors as queries and applies attention mechanism on regional features to acquire…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsREINFORCE
