UnMA-CapSumT: Unified and Multi-Head Attention-driven Caption Summarization Transformer
Dhruv Sharma, Chhavi Dhiman, Dinesh Kumar

TL;DR
This paper introduces UnMA-CapSumT, a transformer-based framework that integrates factual and stylized image captions to generate rich, coherent summaries, addressing issues of vocabulary and repetition.
Contribution
It presents a novel unified transformer model that combines factual and stylized captioning methods for improved image description generation.
Findings
Outperforms existing models on Flickr8K and FlickrStyle10K datasets.
Effectively addresses out-of-vocabulary and repetition issues.
Demonstrates efficient learning of linguistic styles through extensive experiments.
Abstract
Image captioning is the generation of natural language descriptions of images which have increased immense popularity in the recent past. With this different deep-learning techniques are devised for the development of factual and stylized image captioning models. Previous models focused more on the generation of factual and stylized captions separately providing more than one caption for a single image. The descriptions generated from these suffer from out-of-vocabulary and repetition issues. To the best of our knowledge, no such work exists that provided a description that integrates different captioning methods to describe the contents of an image with factual and stylized (romantic and humorous) elements. To overcome these limitations, this paper presents a novel Unified Attention and Multi-Head Attention-driven Caption Summarization Transformer (UnMA-CapSumT) based Captioning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Music and Audio Processing · Cancer-related molecular mechanisms research
MethodsLinear Layer · Adam · Layer Normalization · fastText · Dropout · Position-Wise Feed-Forward Layer · Label Smoothing · Attention Is All You Need · Dense Connections · Byte Pair Encoding
