Group Relative Augmentation for Data Efficient Action Detection
Deep Anil Patel, Iain Melvin, Zachary Izzo, Martin Renqiang Min

TL;DR
This paper introduces a data-efficient method for action detection that combines parameter-efficient tuning, learnable feature augmentation, and a group-weighted loss to improve performance with limited data on complex video datasets.
Contribution
It proposes a novel adaptation strategy for large VLMs using internal feature augmentation and a dynamic loss function, enhancing data efficiency and robustness.
Findings
Achieves high mAP on AVA and MOMA datasets.
Demonstrates significant data efficiency in limited-example scenarios.
Outperforms existing methods in multi-label, multi-person action detection.
Abstract
Adapting large Video-Language Models (VLMs) for action detection using only a few examples poses challenges like overfitting and the granularity mismatch between scene-level pre-training and required person-centric understanding. We propose an efficient adaptation strategy combining parameter-efficient tuning (LoRA) with a novel learnable internal feature augmentation. Applied within the frozen VLM backbone using FiLM, these augmentations generate diverse feature variations directly relevant to the task. Additionally, we introduce a group-weighted loss function that dynamically modulates the training contribution of each augmented sample based on its prediction divergence relative to the group average. This promotes robust learning by prioritizing informative yet reasonable augmentations. We demonstrate our method's effectiveness on complex multi-label, multi-person action detection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
