UnMA-CapSumT: Unified and Multi-Head Attention-driven Caption   Summarization Transformer

Dhruv Sharma; Chhavi Dhiman; Dinesh Kumar

arXiv:2412.11836·cs.CV·December 17, 2024·2 cites

UnMA-CapSumT: Unified and Multi-Head Attention-driven Caption Summarization Transformer

Dhruv Sharma, Chhavi Dhiman, Dinesh Kumar

PDF

Open Access

TL;DR

This paper introduces UnMA-CapSumT, a transformer-based framework that integrates factual and stylized image captions to generate rich, coherent summaries, addressing issues of vocabulary and repetition.

Contribution

It presents a novel unified transformer model that combines factual and stylized captioning methods for improved image description generation.

Findings

01

Outperforms existing models on Flickr8K and FlickrStyle10K datasets.

02

Effectively addresses out-of-vocabulary and repetition issues.

03

Demonstrates efficient learning of linguistic styles through extensive experiments.

Abstract

Image captioning is the generation of natural language descriptions of images which have increased immense popularity in the recent past. With this different deep-learning techniques are devised for the development of factual and stylized image captioning models. Previous models focused more on the generation of factual and stylized captions separately providing more than one caption for a single image. The descriptions generated from these suffer from out-of-vocabulary and repetition issues. To the best of our knowledge, no such work exists that provided a description that integrates different captioning methods to describe the contents of an image with factual and stylized (romantic and humorous) elements. To overcome these limitations, this paper presents a novel Unified Attention and Multi-Head Attention-driven Caption Summarization Transformer (UnMA-CapSumT) based Captioning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Music and Audio Processing · Cancer-related molecular mechanisms research

MethodsLinear Layer · Adam · Layer Normalization · fastText · Dropout · Position-Wise Feed-Forward Layer · Label Smoothing · Attention Is All You Need · Dense Connections · Byte Pair Encoding