Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain
Lianying Chao, Kai Zhang, Haoran Cai, Sijie Wu, Xubin Li, Xin Chen

TL;DR
This paper introduces a domain-specific multi-modal large language model for ICT image captioning, trained via a multi-stage strategy, outperforming larger models in accuracy and BLEU scores.
Contribution
It proposes a novel multi-stage training approach for a domain-specific image captioning model in ICT, combining synthetic and expert-annotated data for improved performance.
Findings
DICModel outperforms state-of-the-art models with fewer parameters.
BLEU score increases by approximately 56.8% over comparable models.
Achieves higher accuracy than Qwen2.5-VL 32B on objective questions.
Abstract
In the information and communications technology (ICT) industry, training a domain-specific large language model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge. However, the knowledge is not only hidden in the textual modality but also in the image modality. Traditional methods can parse text from domain documents but dont have image captioning ability. Multi-modal LLM (MLLM) can understand images, but they do not have sufficient domain knowledge. To address the above issues, this paper proposes a multi-stage progressive training strategy to train a Domain-specific Image Captioning Model (DICModel) in ICT, and constructs a standard evaluation system to validate the performance of DICModel. Specifically, this work first synthesizes about 7K image-text pairs by combining the Mermaid tool and LLMs, which are used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
