Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

Shizhe Chen; Yida Zhao; Qin Jin; Qi Wu

arXiv:2003.00392·cs.CV·March 3, 2020·26 cites

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

Shizhe Chen, Yida Zhao, Qin Jin, Qi Wu

PDF

Open Access 4 Repos 1 Video

TL;DR

This paper introduces a Hierarchical Graph Reasoning model for fine-grained video-text retrieval, decomposing matching into multiple semantic levels to better capture detailed visual and textual information.

Contribution

It proposes a novel hierarchical graph reasoning approach that disentangles texts into semantic levels and guides video representation learning for improved retrieval accuracy.

Findings

01

Outperforms existing methods on three datasets.

02

Enhances ability to distinguish fine-grained semantic differences.

03

Improves generalization across datasets.

Abstract

Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web. The current dominant approach for this problem is to learn a joint embedding space to measure cross-modal similarities. However, simple joint embeddings are insufficient to represent complicated visual and textual details, such as scenes, objects, actions and their compositions. To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels. To be specific, the model disentangles texts into hierarchical semantic graph including three levels of events, actions, entities and relationships across levels. Attention-based graph reasoning is utilized to generate hierarchical textual embeddings, which can guide the learning of diverse and hierarchical video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization