Region-aware Image-based Human Action Retrieval with Transformers
Hongsong Wang, Jianhua Zhao, Jie Gui

TL;DR
This paper introduces a novel transformer-based model for image-based human action retrieval, leveraging multi-region features to improve retrieval accuracy and establishing new benchmarks for the task.
Contribution
It proposes a new end-to-end model with a fusion transformer to effectively combine person, context, and global features for action retrieval.
Findings
Significantly outperforms previous methods on Stanford-40 and PASCAL VOC 2012 datasets.
Establishes benchmarks and baseline methods for image-based action retrieval.
Demonstrates the effectiveness of multi-region feature fusion in action understanding.
Abstract
Human action understanding is a fundamental and challenging task in computer vision. Although there exists tremendous research on this area, most works focus on action recognition, while action retrieval has received less attention. In this paper, we focus on the neglected but important task of image-based action retrieval which aims to find images that depict the same action as a query image. We establish benchmarks for this task and set up important baseline methods for fair comparison. We present an end-to-end model that learns rich action representations from three aspects: the anchored person, contextual regions, and the global image. A novel fusion transformer module is designed to model the relationships among different features and effectively fuse them into an action representation. Experiments on the Stanford-40 and PASCAL VOC 2012 Action datasets show that the proposed method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems
MethodsSparse Evolutionary Training · Focus
