The Solution for Temporal Action Localisation Task of Perception Test Challenge 2024
Yinan Han, Qingyuan Jiang, Hongming Mei, Yang Yang, Jinhui Tang

TL;DR
This paper introduces a multimodal approach for Temporal Action Localisation in videos, combining advanced feature extractors and data augmentation techniques, achieving top performance in the 2024 perception test challenge.
Contribution
The paper presents a novel combination of multimodal feature extraction, data augmentation, and prediction fusion strategies for improved temporal action localisation.
Findings
Achieved a score of 0.5498, securing first place.
Enhanced generalization through data augmentation with overlapping labels.
Effective fusion of video and audio predictions improves localisation accuracy.
Abstract
This report presents our method for Temporal Action Localisation (TAL), which focuses on identifying and classifying actions within specific time intervals throughout a video sequence. We employ a data augmentation technique by expanding the training dataset using overlapping labels from the Something-SomethingV2 dataset, enhancing the model's ability to generalize across various action classes. For feature extraction, we utilize state-of-the-art models, including UMT, VideoMAEv2 for video features, and BEATs and CAV-MAE for audio features. Our approach involves training both multimodal (video and audio) and unimodal (video only) models, followed by combining their predictions using the Weighted Box Fusion (WBF) method. This fusion strategy ensures robust action localisation. our overall approach achieves a score of 0.5498, securing first place in the competition.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection · Advanced Decision-Making Techniques · Visual Attention and Saliency Detection
