VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation
Congqi Cao, Ze Sun, Qinyi Lv, Lingtong Min, Yanning Zhang

TL;DR
This paper introduces VS-TransGRU, a novel framework combining visual-semantic fusion with Transformer and GRU architectures to improve egocentric action anticipation, achieving state-of-the-art results on large-scale datasets.
Contribution
It is the first to incorporate high-level semantic features and a fusion module into a Transformer-GRU framework for egocentric action anticipation.
Findings
Achieves new state-of-the-art performance on EPIC-Kitchens and EGTEA Gaze+ datasets.
Outperforms previous methods by a large margin.
Validates the effectiveness of semantic augmentation and fusion in action prediction.
Abstract
Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions from current and historical observations in the first-person view. Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network to boost the anticipation performance. However, these methods, which merely consider visual information and rely on a single network architecture, gradually reach a performance plateau. In order to fully understand what has been observed and capture the dependencies between current observations and future actions well enough, we propose a novel visual-semantic fusion enhanced and Transformer GRU-based action anticipation framework in this paper. Firstly, high-level semantic information is introduced to improve the performance of action anticipation for the first time. We propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Dropout
