CCL: Cross-modal Correlation Learning with Multi-grained Fusion by Hierarchical Network
Yuxin Peng, Jinwei Qi, Xin Huang, Yuxin Yuan

TL;DR
This paper introduces CCL, a hierarchical network that enhances cross-modal retrieval by modeling multi-grained, intra- and inter-modality correlations with a multi-stage, multi-task learning framework, outperforming existing methods.
Contribution
The paper proposes a novel hierarchical network with multi-grained fusion and multi-level association, addressing limitations of existing cross-modal retrieval methods.
Findings
Achieves the best performance on 6 datasets compared to 13 state-of-the-art methods.
Effectively models intra- and inter-modality correlations with multi-grained fusion.
Utilizes multi-task learning to balance semantic and similarity constraints.
Abstract
Cross-modal retrieval has become a highlighted research topic for retrieval across multimedia data such as image and text. A two-stage learning framework is widely adopted by most existing methods based on Deep Neural Network (DNN): The first learning stage is to generate separate representation for each modality, and the second learning stage is to get the cross-modal common representation. However, the existing methods have three limitations: (1) In the first learning stage, they only model intra-modality correlation, but ignore inter-modality correlation with rich complementary context. (2) In the second learning stage, they only adopt shallow networks with single-loss regularization, but ignore the intrinsic relevance of intra-modality and inter-modality correlation. (3) Only original instances are considered while the complementary fine-grained clues provided by their patches are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
