Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain

Lianying Chao; Kai Zhang; Haoran Cai; Sijie Wu; Xubin Li; Xin Chen

arXiv:2601.09298·cs.CV·May 8, 2026

Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain

Lianying Chao, Kai Zhang, Haoran Cai, Sijie Wu, Xubin Li, Xin Chen

PDF

TL;DR

This paper introduces a domain-specific multi-modal large language model for ICT image captioning, trained via a multi-stage strategy, outperforming larger models in accuracy and BLEU scores.

Contribution

It proposes a novel multi-stage training approach for a domain-specific image captioning model in ICT, combining synthetic and expert-annotated data for improved performance.

Findings

01

DICModel outperforms state-of-the-art models with fewer parameters.

02

BLEU score increases by approximately 56.8% over comparable models.

03

Achieves higher accuracy than Qwen2.5-VL 32B on objective questions.

Abstract

In the information and communications technology (ICT) industry, training a domain-specific large language model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge. However, the knowledge is not only hidden in the textual modality but also in the image modality. Traditional methods can parse text from domain documents but dont have image captioning ability. Multi-modal LLM (MLLM) can understand images, but they do not have sufficient domain knowledge. To address the above issues, this paper proposes a multi-stage progressive training strategy to train a Domain-specific Image Captioning Model (DICModel) in ICT, and constructs a standard evaluation system to validate the performance of DICModel. Specifically, this work first synthesizes about 7K image-text pairs by combining the Mermaid tool and LLMs, which are used…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.