Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing
Zifan Jiang, Youngjoon Jang, Liliane Momeni, G\"ul Varol, Sarah Ebling, Andrew Zisserman

TL;DR
This paper introduces SEA, a universal framework that aligns subtitles with sign language videos across multiple languages using pretrained models for segmentation and embedding, achieving state-of-the-art results efficiently.
Contribution
The paper presents a novel, language-agnostic method for subtitle alignment to sign language videos that outperforms existing approaches in accuracy and efficiency.
Findings
State-of-the-art alignment accuracy on four datasets
Efficient processing within a minute for hour-long videos
Flexible adaptation to various resource levels
Abstract
The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Human Motion and Animation
