Identity-Aware Textual-Visual Matching with Latent Co-attention
Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, Xiaogang Wang

TL;DR
This paper introduces an identity-aware two-stage framework for textual-visual matching that leverages identity annotations and a latent co-attention mechanism to improve matching accuracy significantly.
Contribution
It proposes a novel two-stage CNN-LSTM framework with a Cross-Modal Cross-Entropy loss and latent co-attention, enhancing matching robustness and accuracy over prior methods.
Findings
Outperforms state-of-the-art on three datasets
Effectively screens easy mismatches in stage-1
Refines matching with spatial and semantic co-attention
Abstract
Textual-visual matching aims at measuring similarities between sentence descriptions and images. Most existing methods tackle this problem without effectively utilizing identity-level annotations. In this paper, we propose an identity-aware two-stage framework for the textual-visual matching problem. Our stage-1 CNN-LSTM network learns to embed cross-modal features with a novel Cross-Modal Cross-Entropy (CMCE) loss. The stage-1 network is able to efficiently screen easy incorrect matchings and also provide initial training point for the stage-2 training. The stage-2 CNN-LSTM network refines the matching results with a latent co-attention mechanism. The spatial attention relates each word with corresponding image regions while the latent semantic attention aligns different sentence structures to make the matching results more robust to sentence structure variations. Extensive experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
