Entity-aware and Motion-aware Transformers for Language-driven Action   Localization in Videos

Shuo Yang; Xinxiao Wu

arXiv:2205.05854·cs.CV·May 13, 2022·1 cites

Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos

Shuo Yang, Xinxiao Wu

PDF

Open Access 1 Repo

TL;DR

This paper introduces entity-aware and motion-aware Transformers that improve language-driven action localization in videos by coarsely locating clips and precisely predicting boundaries using motion cues.

Contribution

The paper presents a novel Transformer-based framework that integrates entity and motion information for more accurate action localization in videos.

Findings

01

Outperforms existing methods on Charades-STA and TACoS datasets.

02

Effectively combines entity and motion cues for better boundary prediction.

03

Enhances visual-linguistic alignment with cross-modal and cross-frame attention.

Abstract

Language-driven action localization in videos is a challenging task that involves not only visual-linguistic matching but also action boundary prediction. Recent progress has been achieved through aligning language query to video segments, but estimating precise boundaries is still under-explored. In this paper, we propose entity-aware and motion-aware Transformers that progressively localizes actions in videos by first coarsely locating clips with entity queries and then finely predicting exact boundaries in a shrunken temporal region with motion queries. The entity-aware Transformer incorporates the textual entities into visual representation learning via cross-modal and cross-frame attentions to facilitate attending action-related video clips. The motion-aware Transformer captures fine-grained motion changes at multiple temporal scales via integrating long short-term memory into the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shuoyang129/eamat
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Byte Pair Encoding · Absolute Position Encodings · Residual Connection · Dense Connections · Label Smoothing · Dropout