# Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image Captioning

**Authors:** Deema Abdal Hafeth, Stefanos Kollias

PMC · DOI: 10.3390/s24061796 · Sensors (Basel, Switzerland) · 2024-03-11

## TL;DR

This paper introduces a new image captioning model that uses Transformer networks and semantic object information to generate more accurate and diverse image descriptions.

## Contribution

The novel contribution is integrating instance-level semantic concepts into the encoder's attention mechanism to enhance visual feature representation.

## Key findings

- The model achieves state-of-the-art performance on the MS-COCO dataset.
- Incorporating semantic object information improves caption accuracy and diversity.
- The Transformer-based decoder outperforms traditional RNN-based approaches in language generation.

## Abstract

Image captioning is a technique used to generate descriptive captions for images. Typically, it involves employing a Convolutional Neural Network (CNN) as the encoder to extract visual features, and a decoder model, often based on Recurrent Neural Networks (RNNs), to generate the captions. Recently, the encoder–decoder architecture has witnessed the widespread adoption of the self-attention mechanism. However, this approach faces certain challenges that require further research. One such challenge is that the extracted visual features do not fully exploit the available image information, primarily due to the absence of semantic concepts. This limitation restricts the ability to fully comprehend the content depicted in the image. To address this issue, we present a new image-Transformer-based model boosted with image object semantic representation. Our model incorporates semantic representation in encoder attention, enhancing visual features by integrating instance-level concepts. Additionally, we employ Transformer as the decoder in the language generation module. By doing so, we achieve improved performance in generating accurate and diverse captions. We evaluated the performance of our model on the MS-COCO and novel MACE datasets. The results illustrate that our model aligns with state-of-the-art approaches in terms of caption generation.

## Full-text entities

- **Diseases:** injury to people or property (MESH:C000719191), visual impairments (MESH:D014786), MACE (MESH:D010033)
- **Chemicals:** lithium-ion (-)
- **Species:** Chryseobacterium sp. AR (species) [taxon 1637707], Canis lupus familiaris (dog, subspecies) [taxon 9615], Felis catus (cat, species) [taxon 9685], Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC10975165/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC10975165/full.md

## References

58 references — full list in the complete paper: https://tomesphere.com/paper/PMC10975165/full.md

---
Source: https://tomesphere.com/paper/PMC10975165