Transformer with Controlled Attention for Synchronous Motion Captioning
Karim Radouane, Sylvie Ranwez, Julien Lagarde, Andon, Tchechmedjiev

TL;DR
This paper introduces a Transformer-based model with controlled attention mechanisms for synchronous motion captioning, enabling interpretable, time-aligned text generation synchronized with human motion sequences, and demonstrates superior results on benchmark datasets.
Contribution
The paper proposes a novel attention control method within Transformers for synchronized motion captioning, enhancing interpretability and temporal alignment.
Findings
Achieved state-of-the-art performance on KIT-ML and HumanML3D datasets.
Provided visualizations demonstrating synchronized motion and caption generation.
Introduced masking strategies to focus attention on important motion frames.
Abstract
In this paper, we address a challenging task, synchronous motion captioning, that aim to generate a language description synchronized with human motion sequences. This task pertains to numerous applications, such as aligned sign language transcription, unsupervised action segmentation and temporal grounding. Our method introduces mechanisms to control self- and cross-attention distributions of the Transformer, allowing interpretability and time-aligned text generation. We achieve this through masking strategies and structuring losses that push the model to maximize attention only on the most important frames contributing to the generation of a motion word. These constraints aim to prevent undesired mixing of information in attention maps and to provide a monotonic attention distribution across tokens. Thus, the cross attentions of tokens are used for progressive text generation in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Video Analysis and Summarization · Image and Video Stabilization
MethodsAttention Is All You Need · Sparse Evolutionary Training · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Layer Normalization · Dropout · Position-Wise Feed-Forward Layer · Residual Connection
