# Fine-Grained Action Retrieval Through Multiple Parts-of-Speech   Embeddings

**Authors:** Michael Wray, Diane Larlus, Gabriela Csurka, Dima Damen

arXiv: 1908.03477 · 2019-08-12

## TL;DR

This paper introduces a novel approach for cross-modal fine-grained action retrieval by disentangling parts-of-speech in captions, creating specialized embeddings that improve retrieval accuracy on large-scale datasets.

## Contribution

It proposes a multi-space embedding method that separates PoS tags, enabling more precise cross-modal retrieval and demonstrating effectiveness on EPIC and MSR-VTT datasets.

## Key findings

- Improved retrieval performance on EPIC dataset in zero-shot setting.
- Enhanced cross-modal retrieval on MSR-VTT dataset.
- Benefits of PoS disentanglement for fine-grained action retrieval.

## Abstract

We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS tag. The outputs of multiple PoS embeddings are then used as input to an integrated multi-modal space, where we perform action retrieval. All embeddings are trained jointly through a combination of PoS-aware and PoS-agnostic losses. Our proposal enables learning specialised embedding spaces that offer multiple views of the same embedded entities.   We report the first retrieval results on fine-grained actions for the large-scale EPIC dataset, in a generalised zero-shot setting. Results show the advantage of our approach for both video-to-text and text-to-video action retrieval. We also demonstrate the benefit of disentangling the PoS for the generic task of cross-modal video retrieval on the MSR-VTT dataset.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1908.03477/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/1908.03477/full.md

## References

42 references — full list in the complete paper: https://tomesphere.com/paper/1908.03477/full.md

---
Source: https://tomesphere.com/paper/1908.03477