ACE: Action Concept Enhancement of Video-Language Models in Procedural   Videos

Reza Ghoddoosian; Nakul Agarwal; Isht Dwivedi; Behzad Darisuh

arXiv:2411.15628·cs.CV·November 26, 2024

ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos

Reza Ghoddoosian, Nakul Agarwal, Isht Dwivedi, Behzad Darisuh

PDF

Open Access

TL;DR

This paper introduces ACE, a fine-tuning method that enhances video-language models' understanding of procedural actions by incorporating augmented synonyms and negatives, leading to better zero-shot classification in cooking and assembly videos.

Contribution

The paper proposes a novel fine-tuning technique, ACE, that improves VLMs' robustness and understanding of procedural actions through label augmentation and stochastic replacement.

Findings

01

Significant improvements in zero-shot action classification accuracy.

02

Enhanced embedding alignment of unseen action synonyms.

03

Maintained competitive performance on seen actions.

Abstract

Vision-language models (VLMs) are capable of recognizing unseen actions. However, existing VLMs lack intrinsic understanding of procedural action concepts. Hence, they overfit to fixed labels and are not invariant to unseen action synonyms. To address this, we propose a simple fine-tuning technique, Action Concept Enhancement (ACE), to improve the robustness and concept understanding of VLMs in procedural action classification. ACE continually incorporates augmented action synonyms and negatives in an auxiliary classification loss by stochastically replacing fixed labels during training. This creates new combinations of action labels over the course of fine-tuning and prevents overfitting to fixed action representations. We show the enhanced concept understanding of our VLM, by visualizing the alignment of encoded embeddings of unseen action synonyms in the embedding space. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications