Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

Haonan Jia; Shichao Dong; Xin Dong; Zenghui Sun; Jin Wang; Jinsong Lan; Xiaoyong Zhu; Bo Zheng; Kaifu Zhang

arXiv:2603.01696·cs.CV·March 3, 2026

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

Haonan Jia, Shichao Dong, Xin Dong, Zenghui Sun, Jin Wang, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Kaifu Zhang

PDF

Open Access

TL;DR

This paper introduces Cross-modal Identity Mapping (CIM), a reinforcement learning framework that reduces information loss in image captioning by aligning visual content and text, improving caption quality without extra annotations.

Contribution

The paper proposes CIM, a novel reinforcement learning approach that minimizes information loss in image captioning through identity mapping, without requiring additional labeled data.

Findings

01

CIM outperforms supervised fine-tuning methods in image captioning.

02

On COCO-LN500, CIM achieves 20% better relation reasoning.

03

CIM enhances image captioning quality by focusing on image details.

Abstract

Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis