Can VLMs be used on videos for action recognition? LLMs are Visual   Reasoning Coordinators

Harsh Lunia

arXiv:2407.14834·cs.CV·July 23, 2024·1 cites

Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators

Harsh Lunia

PDF

Open Access

TL;DR

This paper explores using a coordinated system of vision-language models and large language models to recognize actions in surveillance videos with limited temporal information, demonstrating promising results but highlighting the need for stronger temporal cues.

Contribution

It introduces a novel framework where LLMs coordinate multiple VLMs for action recognition in videos, extending previous commonsense reasoning capabilities to temporal video analysis.

Findings

01

LLMs can effectively coordinate VLMs for action recognition in videos.

02

The approach performs well with minimal temporal information.

03

Enhancing temporal signals could improve accuracy.

Abstract

Recent advancements have introduced multiple vision-language models (VLMs) demonstrating impressive commonsense reasoning across various domains. Despite their individual capabilities, the potential of synergizing these complementary VLMs remains underexplored. The Cola Framework addresses this by showcasing how a large language model (LLM) can efficiently coordinate multiple VLMs through natural language communication, leveraging their distinct strengths. We have verified this claim on the challenging A-OKVQA dataset, confirming the effectiveness of such coordination. Building on this, our study investigates whether the same methodology can be applied to surveillance videos for action recognition. Specifically, we explore if leveraging the combined knowledge base of VLMs and LLM can effectively deduce actions from a video when presented with only a few selectively important frames and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications

MethodsCOLA · Balanced Selection