Image Captioning with Visual Object Representations Grounded in the Textual Modality
Du\v{s}an Vari\v{s}, Katsuhito Sudoh, and Satoshi Nakamura

TL;DR
This paper explores a novel approach to image captioning by grounding visual object representations in the textual embedding space, aiming to improve training efficiency and semantic alignment.
Contribution
It introduces a method that grounds visual object representations in the captioning system's word embedding space, differing from traditional image grounding approaches.
Findings
Grounded models reach training criteria faster, needing fewer updates.
Grounding improves the structural correlation between word embeddings and object vectors.
The approach maintains comparable captioning performance with enhanced training efficiency.
Abstract
We present our work in progress exploring the possibilities of a shared embedding space between textual and visual modality. Leveraging the textual nature of object detection labels and the hypothetical expressiveness of extracted visual object representations, we propose an approach opposite to the current trend, grounding of the representations in the word embedding space of the captioning system instead of grounding words or sentences in their associated images. Based on the previous work, we apply additional grounding losses to the image captioning training objective aiming to force visual object representations to create more heterogeneous clusters based on their class label and copy a semantic structure of the word embedding space. In addition, we provide an analysis of the learned object vector space projection and its impact on the IC system performance. With only slight change…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
