Image Captioning using Multiple Transformers for Self-Attention   Mechanism

Farrukh Olimov; Shikha Dubey; Labina Shrestha; Tran Trung Tin; Moongu; Jeon

arXiv:2103.05103·cs.CV·March 10, 2021

Image Captioning using Multiple Transformers for Self-Attention Mechanism

Farrukh Olimov, Shikha Dubey, Labina Shrestha, Tran Trung Tin, Moongu, Jeon

PDF

Open Access

TL;DR

This paper introduces MTSM, a novel image captioning method using multiple transformers to improve real-time captioning accuracy by capturing local and global object relationships.

Contribution

It proposes a new transformer-based framework that integrates region proposals and self-attention for enhanced image captioning performance.

Findings

01

Achieves improved captioning accuracy on MSCOCO dataset.

02

Effectively models local and global object relationships.

03

Demonstrates competitive real-time captioning capabilities.

Abstract

Real-time image captioning, along with adequate precision, is the main challenge of this research field. The present work, Multiple Transformers for Self-Attention Mechanism (MTSM), utilizes multiple transformers to address these problems. The proposed algorithm, MTSM, acquires region proposals using a transformer detector (DETR). Consequently, MTSM achieves the self-attention mechanism by transferring these region proposals and their visual and geometrical features through another transformer and learns the objects' local and global interconnections. The qualitative and quantitative results of the proposed algorithm, MTSM, are shown on the MSCOCO dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition