Beyond Vision: Contextually Enriched Image Captioning with Multi-Modal Retrieval

Nguyen Lam Phu Quy; Pham Phu Hoa; Tran Chi Nguyen; Dao Sy Duy Minh; Nguyen Hoang Minh Ngoc; Huynh Trung Kiet

arXiv:2512.20042·cs.CV·February 3, 2026

Beyond Vision: Contextually Enriched Image Captioning with Multi-Modal Retrieval

Nguyen Lam Phu Quy, Pham Phu Hoa, Tran Chi Nguyen, Dao Sy Duy Minh, Nguyen Hoang Minh Ngoc, Huynh Trung Kiet

PDF

Open Access

TL;DR

This paper introduces a multimodal pipeline that enhances image captions with external knowledge, providing richer, context-aware descriptions for real-world applications like journalism and digital archives.

Contribution

It presents a novel system combining image retrieval, geometric reranking, semantic search, and large language models to generate more informative image captions.

Findings

01

Generated captions are significantly more informative than traditional methods.

02

The approach effectively incorporates external textual knowledge into image captioning.

03

Demonstrates strong potential for applications requiring deep visual-textual understanding.

Abstract

Real-world image captions often lack contextual depth, omitting crucial details such as event background, temporal cues, outcomes, and named entities that are not visually discernible. This gap limits the effectiveness of image understanding in domains like journalism, education, and digital archives, where richer, more informative descriptions are essential. To address this, we propose a multimodal pipeline that augments visual input with external textual knowledge. Our system retrieves semantically similar images using BEIT-3 (Flickr30k-384 and COCO-384) and SigLIP So-384, reranks them using ORB and SIFT for geometric alignment, and extracts contextual information from related articles via semantic search. A fine-tuned Qwen3 model with QLoRA then integrates this context with base captions generated by Instruct BLIP (Vicuna-7B) to produce event-enriched, context-aware descriptions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques