Language-driven Description Generation and Common Sense Reasoning for Video Action Recognition

Xiaodan Hu; Chuhang Zou; Suchen Wang; Jaechul Kim; Narendra Ahuja

arXiv:2506.16701·cs.CV·June 23, 2025

Language-driven Description Generation and Common Sense Reasoning for Video Action Recognition

Xiaodan Hu, Chuhang Zou, Suchen Wang, Jaechul Kim, Narendra Ahuja

PDF

Open Access

TL;DR

This paper presents a novel framework that leverages language-driven common sense priors to improve video action recognition, especially in cluttered and occluded scenes, by integrating scene description and reasoning with visual cues.

Contribution

It introduces a new approach combining language-based scene understanding and common sense reasoning to enhance video action recognition performance.

Findings

01

Improved accuracy on Action Genome and Charades datasets.

02

Effective integration of textual and visual cues for action recognition.

03

Enhanced understanding of occluded and cluttered scenes.

Abstract

Recent video action recognition methods have shown excellent performance by adapting large-scale pre-trained language-image models to the video domain. However, language models contain rich common sense priors - the scene contexts that humans use to constitute an understanding of objects, human-object interactions, and activities - that have not been fully exploited. In this paper, we introduce a framework incorporating language-driven common sense priors to identify cluttered video action sequences from monocular views that are often heavily occluded. We propose: (1) A video context summary component that generates candidate objects, activities, and the interactions between objects and activities; (2) A description generation module that describes the current scene given the context and infers subsequent activities, through auxiliary prompts and common sense reasoning; (3) A…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Explainable Artificial Intelligence (XAI)