Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
Shizhe Chen, Yida Zhao, Qin Jin, Qi Wu

TL;DR
This paper introduces a Hierarchical Graph Reasoning model for fine-grained video-text retrieval, decomposing matching into multiple semantic levels to better capture detailed visual and textual information.
Contribution
It proposes a novel hierarchical graph reasoning approach that disentangles texts into semantic levels and guides video representation learning for improved retrieval accuracy.
Findings
Outperforms existing methods on three datasets.
Enhances ability to distinguish fine-grained semantic differences.
Improves generalization across datasets.
Abstract
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web. The current dominant approach for this problem is to learn a joint embedding space to measure cross-modal similarities. However, simple joint embeddings are insufficient to represent complicated visual and textual details, such as scenes, objects, actions and their compositions. To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels. To be specific, the model disentangles texts into hierarchical semantic graph including three levels of events, actions, entities and relationships across levels. Attention-based graph reasoning is utilized to generate hierarchical textual embeddings, which can guide the learning of diverse and hierarchical video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
