Transformer with Controlled Attention for Synchronous Motion Captioning

Karim Radouane; Sylvie Ranwez; Julien Lagarde; Andon; Tchechmedjiev

arXiv:2409.09177·cs.CV·September 17, 2024

Transformer with Controlled Attention for Synchronous Motion Captioning

Karim Radouane, Sylvie Ranwez, Julien Lagarde, Andon, Tchechmedjiev

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Transformer-based model with controlled attention mechanisms for synchronous motion captioning, enabling interpretable, time-aligned text generation synchronized with human motion sequences, and demonstrates superior results on benchmark datasets.

Contribution

The paper proposes a novel attention control method within Transformers for synchronized motion captioning, enhancing interpretability and temporal alignment.

Findings

01

Achieved state-of-the-art performance on KIT-ML and HumanML3D datasets.

02

Provided visualizations demonstrating synchronized motion and caption generation.

03

Introduced masking strategies to focus attention on important motion frames.

Abstract

In this paper, we address a challenging task, synchronous motion captioning, that aim to generate a language description synchronized with human motion sequences. This task pertains to numerous applications, such as aligned sign language transcription, unsupervised action segmentation and temporal grounding. Our method introduces mechanisms to control self- and cross-attention distributions of the Transformer, allowing interpretability and time-aligned text generation. We achieve this through masking strategies and structuring losses that push the model to maximize attention only on the most important frames contributing to the generation of a motion word. These constraints aim to prevent undesired mixing of information in attention maps and to provide a monotonic attention distribution across tokens. Thus, the cross attentions of tokens are used for progressive text generation in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rd20karim/synch-transformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Video Analysis and Summarization · Image and Video Stabilization

MethodsAttention Is All You Need · Sparse Evolutionary Training · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Layer Normalization · Dropout · Position-Wise Feed-Forward Layer · Residual Connection