# Automated Radiological Report Generation from Breast Ultrasound Images Using Vision and Language Transformers

**Authors:** Shaheen Khatoon, Azhar Mahmood

PMC · DOI: 10.3390/jimaging12020068 · 2026-02-06

## TL;DR

This paper introduces a new AI system that automatically generates radiology reports for breast ultrasound images using advanced machine learning techniques.

## Contribution

The novel contribution is a multimodal Transformer framework that combines Vision Transformers and biomedical language models for breast ultrasound report generation.

## Key findings

- BioBERT-based models show higher clinical specificity compared to general language models.
- GPT-2-based decoders enhance the fluency of generated reports.
- The proposed framework outperforms prior convolutional–recurrent architectures in report quality.

## Abstract

Breast ultrasound imaging is widely used for the detection and characterization of breast abnormalities; however, generating detailed and consistent radiological reports remains a labor-intensive and subjective process. Recent advances in deep learning have demonstrated the potential of automated report generation systems to support clinical workflows, yet most existing approaches focus on chest X-ray imaging and rely on convolutional–recurrent architectures with limited capacity to model long-range dependencies and complex clinical semantics. In this work, we propose a multimodal Transformer-based framework for automatic breast ultrasound report generation that integrates visual and textual information through cross-attention mechanisms. The proposed architecture employs a Vision Transformer (ViT) to extract rich spatial and morphological features from ultrasound images. For textual embedding, pretrained language models (BERT, BioBERT, and GPT-2) are implemented in various encoder–decoder configurations to leverage both general linguistic knowledge and domain-specific biomedical semantics. A multimodal Transformer decoder is implemented to autoregressively generate diagnostic reports by jointly attending to visual features and contextualized textual embeddings. We conducted an extensive quantitative evaluation using standard report generation metrics, including BLEU, ROUGE-L, METEOR, and CIDEr, to assess lexical accuracy, semantic alignment, and clinical relevance. Experimental results demonstrate that BioBERT-based models consistently outperform general domain counterparts in clinical specificity, while GPT-2-based decoders improve linguistic fluency.

## Linked entities

- **Diseases:** breast cancer (MONDO:0004989)

## Full-text entities

- **Genes:** VIT (vitrin) [NCBI Gene 5212] {aka VIT1}, ABL2 (ABL proto-oncogene 2, non-receptor tyrosine kinase) [NCBI Gene 27] {aka ABLL, ARG}, GPT2 (glutamic--pyruvic transaminase 2) [NCBI Gene 84706] {aka ALT2, GPT 2, MRT49, NEDSPM}
- **Diseases:** BrEaST (MESH:D061325), Breast cancer (MESH:D001943), cyst (MESH:D003560), intraductal papilloma (MESH:D018300), dysplasia (MESH:D015792), mammary duct ectasia (MESH:D004108), injury to (MESH:D014947), fatty (MESH:D008067), Cancer (MESH:D009369), lesion (MESH:D009059), fibroadenoma (MESH:D018226)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12941839/full.md

---
Source: https://tomesphere.com/paper/PMC12941839