Query matching for spatio-temporal action detection with query-based object detector
Shimon Hori, Kazuki Omi, Toru Tamaki

TL;DR
This paper extends the DETR object detection model to spatio-temporal action detection in videos by introducing query matching across frames to maintain temporal consistency, resulting in improved performance on the JHMDB21 dataset.
Contribution
It introduces a novel query matching method across frames for DETR-based spatio-temporal detection, addressing temporal consistency issues in video analysis.
Findings
Significant performance improvement on JHMDB21 dataset
Effective query matching enhances temporal consistency
Method outperforms baseline DETR in video action detection
Abstract
In this paper, we propose a method that extends the query-based object detection model, DETR, to spatio-temporal action detection, which requires maintaining temporal consistency in videos. Our proposed method applies DETR to each frame and uses feature shift to incorporate temporal information. However, DETR's object queries in each frame may correspond to different objects, making a simple feature shift ineffective. To overcome this issue, we propose query matching across different frames, ensuring that queries for the same object are matched and used for the feature shift. Experimental results show that performance on the JHMDB21 dataset improves significantly when query features are shifted using the proposed query matching.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Time Series Analysis and Forecasting
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Layer Normalization · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings
