# Deep learning–driven image captioning: Progress through transformers and large language models

**Authors:** Priyanka Panchal, Vishal Polara, Siddaraj U, Abdullah Baz, Shobhit K. Patel

PMC · DOI: 10.1371/journal.pone.0345012 · PLOS One · 2026-03-16

## TL;DR

This paper introduces a new deep learning model for image captioning using vision transformers and large language models, achieving better performance than existing methods.

## Contribution

A novel vision transformer-based model with a unique cross-attention mechanism for improved image captioning.

## Key findings

- The proposed model outperforms existing methods like GIT, BLIP-2, and CoCa on metrics like BLEU-4, METEOR, and CIDEr.
- The model achieves 0.495 BLEU-4, 0.390 METEOR, and 1.32 CIDEr scores on the MS COCO dataset.
- The model addresses challenges like caption diversity, multimodal alignment, and bias mitigation.

## Abstract

This paper provides a novel deep learning model for captioning of images by using an advanced vision transformer architecture with a powerful LLM. Proposed models show a significant improvement over traditional CNN-RNN hybrids and existing transformer-based approaches by integrating a unique cross-attention mechanism that enables deep alignment between linguistic context and visual features. We show the superiority of our proposed architecture through extensive evaluation on different datasets like MSCOCO, Flickr30K, and NoCaps. The proposed model consistently shows good performance for leading methods such as GIT, BLIP-2, and CoCa across a comprehensive suite of metrics. On the MS COCO dataset, the BLEU-4, METEOR, and CIDEr scores of proposed models are equal to 0.495, 0.390, and 1.32, respectively. In this paper, we have critically analyzed the key challenges of this field, like enhancing caption diversity, ensuring robust multimodal alignment, and mitigating inherent biases. By providing a new performance level, the proposed model provides a source of reference for the next generation of image captioning systems. The results show the efficiency of our fusion strategy and facilitate the development of techniques that use models that can produce more precise, contextually rich, and human-like image depictions. This work supports SDG 9 (Industry, Innovation, and Infrastructure) by advancing multimodal AI systems, and SDG 4 (Quality Education) by enabling intelligent and accessible image understanding technologies.

## Full-text entities

- **Genes:** VIT (vitrin) [NCBI Gene 5212] {aka VIT1}
- **Diseases:** visually impaired (MESH:D014786), LLMs (MESH:D007806)
- **Chemicals:** LLM (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12991260/full.md

## Figures

13 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12991260/full.md

## References

48 references — full list in the complete paper: https://tomesphere.com/paper/PMC12991260/full.md

---
Source: https://tomesphere.com/paper/PMC12991260