Region-aware Image-based Human Action Retrieval with Transformers

Hongsong Wang; Jianhua Zhao; Jie Gui

arXiv:2407.09924·cs.CV·July 30, 2024

Region-aware Image-based Human Action Retrieval with Transformers

Hongsong Wang, Jianhua Zhao, Jie Gui

PDF

Open Access

TL;DR

This paper introduces a novel transformer-based model for image-based human action retrieval, leveraging multi-region features to improve retrieval accuracy and establishing new benchmarks for the task.

Contribution

It proposes a new end-to-end model with a fusion transformer to effectively combine person, context, and global features for action retrieval.

Findings

01

Significantly outperforms previous methods on Stanford-40 and PASCAL VOC 2012 datasets.

02

Establishes benchmarks and baseline methods for image-based action retrieval.

03

Demonstrates the effectiveness of multi-region feature fusion in action understanding.

Abstract

Human action understanding is a fundamental and challenging task in computer vision. Although there exists tremendous research on this area, most works focus on action recognition, while action retrieval has received less attention. In this paper, we focus on the neglected but important task of image-based action retrieval which aims to find images that depict the same action as a query image. We establish benchmarks for this task and set up important baseline methods for fair comparison. We present an end-to-end model that learns rich action representations from three aspects: the anchored person, contextual regions, and the global image. A novel fusion transformer module is designed to model the relationships among different features and effectively fuse them into an action representation. Experiments on the Stanford-40 and PASCAL VOC 2012 Action datasets show that the proposed method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems

MethodsSparse Evolutionary Training · Focus