# Long-text caption generation for surgical image with a concept retrieval augmented large multimodal model

**Authors:** Jiquan Liu, Yichen Zhu, Jingyi Feng, Xiaoyan Zhang, Ziyu Zhou, Ye Tao, Huilong Duan

PMC · DOI: 10.1371/journal.pone.0343823 · PLOS One · 2026-03-17

## TL;DR

This paper introduces a new method for generating detailed captions for surgical images using a specialized model that reduces errors and improves accuracy.

## Contribution

A novel retrieval-augmented framework for long-text surgical captioning with a verified dataset and clinically-aligned evaluation.

## Key findings

- A verified long-text surgical captioning benchmark was created from the EndoVis2018 dataset.
- The retrieval-augmented model significantly reduces hallucinations and improves caption accuracy.
- Clinically-aligned evaluation metrics outperform traditional n-gram metrics for long medical text.

## Abstract

Surgical image captioning is critical for automated reporting and education but is currently limited by a lack of long-text datasets and the tendency of generic Multimodal Large Language Models (MLLMs) to hallucinate medical details. To address this, we present a comprehensive framework for long-text surgical captioning. First, we construct a verified long-text benchmark extending the EndoVis2018 dataset, utilizing an automated pipeline with expert-in-the-loop validation to transform brief triplets into rich narratives. Second, we investigate domain-specific adaptation strategies for MLLMs. We implement a surgical concept retrieval-augmented generation (RAG) mechanism that dynamically injects specialized knowledge (instruments, actions) into the visual encoder, effectively mitigating domain-specific hallucinations common in generic models. Finally, recognizing the inadequacy of n-gram metrics for long medical text, we establish a robust evaluation protocol using clinically-aligned metrics. Extensive experiments demonstrate that our data-centric and retrieval-enhanced approach significantly outperforms baselines in producing clinically accurate, coherent long descriptions.

## Full-text entities

- **Diseases:** hallucinations (MESH:D006212)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12995301/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12995301/full.md

## References

35 references — full list in the complete paper: https://tomesphere.com/paper/PMC12995301/full.md

---
Source: https://tomesphere.com/paper/PMC12995301