VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by   Visual-Semantic Fusion for Egocentric Action Anticipation

Congqi Cao; Ze Sun; Qinyi Lv; Lingtong Min; Yanning Zhang

arXiv:2307.03918·cs.CV·July 11, 2023·1 cites

VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation

Congqi Cao, Ze Sun, Qinyi Lv, Lingtong Min, Yanning Zhang

PDF

Open Access

TL;DR

This paper introduces VS-TransGRU, a novel framework combining visual-semantic fusion with Transformer and GRU architectures to improve egocentric action anticipation, achieving state-of-the-art results on large-scale datasets.

Contribution

It is the first to incorporate high-level semantic features and a fusion module into a Transformer-GRU framework for egocentric action anticipation.

Findings

01

Achieves new state-of-the-art performance on EPIC-Kitchens and EGTEA Gaze+ datasets.

02

Outperforms previous methods by a large margin.

03

Validates the effectiveness of semantic augmentation and fusion in action prediction.

Abstract

Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions from current and historical observations in the first-person view. Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network to boost the anticipation performance. However, these methods, which merely consider visual information and rely on a single network architecture, gradually reach a performance plateau. In order to fully understand what has been observed and capture the dependencies between current observations and future actions well enough, we propose a novel visual-semantic fusion enhanced and Transformer GRU-based action anticipation framework in this paper. Firstly, high-level semantic information is introduced to improve the performance of action anticipation for the first time. We propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Visual Attention and Saliency Detection

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Dropout