Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware   Graph Transformer for Video Captioning

Caihua Liu; Xu Li; Wenjing Xue; Wei Tang; Xia Feng

arXiv:2502.13754·cs.CV·February 20, 2025

Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video Captioning

Caihua Liu, Xu Li, Wenjing Xue, Wei Tang, Xia Feng

PDF

Open Access

TL;DR

This paper introduces a dynamic graph transformer that captures complex object behaviors over time for improved video captioning, leveraging multi-scale temporal modeling and semantic-aware modules.

Contribution

It proposes a novel approach combining multi-scale temporal modeling and semantic-aware modules within a graph transformer for richer behavior representations in video captioning.

Findings

01

Significant performance improvements on MSVD and MSR-VTT datasets.

02

Enhanced richness and accuracy of action representations.

03

Effective modeling of complex temporal dependencies.

Abstract

Existing video captioning methods merely provide shallow or simplistic representations of object behaviors, resulting in superficial and ambiguous descriptions. However, object behavior is dynamic and complex. To comprehensively capture the essence of object behavior, we propose a dynamic action semantic-aware graph transformer. Firstly, a multi-scale temporal modeling module is designed to flexibly learn long and short-term latent action features. It not only acquires latent action features across time scales, but also considers local latent action details, enhancing the coherence and sensitiveness of latent action representations. Secondly, a visual-action semantic aware module is proposed to adaptively capture semantic representations related to object behavior, enhancing the richness and accurateness of action representations. By harnessing the collaborative efforts of these two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Laplacian EigenMap · Multi-Head Attention · Attentive Walk-Aggregating Graph Neural Network · Position-Wise Feed-Forward Layer