A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition

Peiqin Zhuang; Lei Bai; Yichao Wu; Ding Liang; Luping Zhou; Yali Wang; Wanli Ouyang

arXiv:2510.18705·cs.CV·October 24, 2025

A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition

Peiqin Zhuang, Lei Bai, Yichao Wu, Ding Liang, Luping Zhou, Yali Wang, Wanli Ouyang

PDF

Open Access

TL;DR

This paper introduces EMIM, a module that enhances transformer-based action recognition by explicitly modeling motion through affinity matrices, significantly improving performance on motion-sensitive datasets.

Contribution

The paper proposes EMIM, a novel module that integrates explicit motion information mining into transformers using affinity matrices, addressing limitations on motion-sensitive datasets.

Findings

01

Outperforms state-of-the-art methods on multiple datasets

02

Significantly improves results on motion-sensitive datasets

03

Validates the effectiveness of explicit motion modeling

Abstract

Recently, action recognition has been dominated by transformer-based methods, thanks to their spatiotemporal contextual aggregation capacities. However, despite the significant progress achieved on scene-related datasets, they do not perform well on motion-sensitive datasets due to the lack of elaborate motion modeling designs. Meanwhile, we observe that the widely-used cost volume in traditional action recognition is highly similar to the affinity matrix defined in self-attention, but equipped with powerful motion modeling capacities. In light of this, we propose to integrate those effective motion modeling properties into the existing transformer in a unified and neat way, with the proposal of the Explicit Motion Information Mining module (EMIM). In EMIM, we propose to construct the desirable affinity matrix in a cost volume style, where the set of key candidate tokens is sampled from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Context-Aware Activity Recognition Systems