Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators
Harsh Lunia

TL;DR
This paper explores using a coordinated system of vision-language models and large language models to recognize actions in surveillance videos with limited temporal information, demonstrating promising results but highlighting the need for stronger temporal cues.
Contribution
It introduces a novel framework where LLMs coordinate multiple VLMs for action recognition in videos, extending previous commonsense reasoning capabilities to temporal video analysis.
Findings
LLMs can effectively coordinate VLMs for action recognition in videos.
The approach performs well with minimal temporal information.
Enhancing temporal signals could improve accuracy.
Abstract
Recent advancements have introduced multiple vision-language models (VLMs) demonstrating impressive commonsense reasoning across various domains. Despite their individual capabilities, the potential of synergizing these complementary VLMs remains underexplored. The Cola Framework addresses this by showcasing how a large language model (LLM) can efficiently coordinate multiple VLMs through natural language communication, leveraging their distinct strengths. We have verified this claim on the challenging A-OKVQA dataset, confirming the effectiveness of such coordination. Building on this, our study investigates whether the same methodology can be applied to surveillance videos for action recognition. Specifically, we explore if leveraging the combined knowledge base of VLMs and LLM can effectively deduce actions from a video when presented with only a few selectively important frames and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications
MethodsCOLA · Balanced Selection
