CONCAP: Seeing Beyond English with Concepts Retrieval-Augmented Captioning
George Ibrahim, Rita Ramos, and Yova Kementchedjhieva

TL;DR
CONCAP is a multilingual image captioning model that leverages concept-aware retrieval to improve caption quality across languages, especially in low-resource settings, by reducing data needs and enhancing contextual grounding.
Contribution
The paper introduces CONCAP, a novel retrieval-augmented captioning approach that incorporates image-specific concepts to improve multilingual captioning performance.
Findings
CONCAP outperforms existing models on low-resource languages.
Concept-aware retrieval reduces data requirements significantly.
The model achieves strong results on the XM3600 dataset.
Abstract
Multilingual vision-language models have made significant strides in image captioning, yet they still lag behind their English counterparts due to limited multilingual training data and costly large-scale model parameterization. Retrieval-augmented generation (RAG) offers a promising alternative by conditioning caption generation on retrieved examples in the target language, reducing the need for extensive multilingual training. However, multilingual RAG captioning models often depend on retrieved captions translated from English, which can introduce mismatches and linguistic biases relative to the source language. We introduce CONCAP, a multilingual image captioning model that integrates retrieved captions with image-specific concepts, enhancing the contextualization of the input image and grounding the captioning process across different languages. Experiments on the XM3600 dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
