Identity-Aware Textual-Visual Matching with Latent Co-attention

Shuang Li; Tong Xiao; Hongsheng Li; Wei Yang; Xiaogang Wang

arXiv:1708.01988·cs.CV·August 8, 2017·36 cites

Identity-Aware Textual-Visual Matching with Latent Co-attention

Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, Xiaogang Wang

PDF

Open Access

TL;DR

This paper introduces an identity-aware two-stage framework for textual-visual matching that leverages identity annotations and a latent co-attention mechanism to improve matching accuracy significantly.

Contribution

It proposes a novel two-stage CNN-LSTM framework with a Cross-Modal Cross-Entropy loss and latent co-attention, enhancing matching robustness and accuracy over prior methods.

Findings

01

Outperforms state-of-the-art on three datasets

02

Effectively screens easy mismatches in stage-1

03

Refines matching with spatial and semantic co-attention

Abstract

Textual-visual matching aims at measuring similarities between sentence descriptions and images. Most existing methods tackle this problem without effectively utilizing identity-level annotations. In this paper, we propose an identity-aware two-stage framework for the textual-visual matching problem. Our stage-1 CNN-LSTM network learns to embed cross-modal features with a novel Cross-Modal Cross-Entropy (CMCE) loss. The stage-1 network is able to efficiently screen easy incorrect matchings and also provide initial training point for the stage-2 training. The stage-2 CNN-LSTM network refines the matching results with a latent co-attention mechanism. The spatial attention relates each word with corresponding image regions while the latent semantic attention aligns different sentence structures to make the matching results more robust to sentence structure variations. Extensive experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques