Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

Zifan Jiang; Youngjoon Jang; Liliane Momeni; G\"ul Varol; Sarah Ebling; Andrew Zisserman

arXiv:2512.08094·cs.CL·December 10, 2025

Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

Zifan Jiang, Youngjoon Jang, Liliane Momeni, G\"ul Varol, Sarah Ebling, Andrew Zisserman

PDF

Open Access

TL;DR

This paper introduces SEA, a universal framework that aligns subtitles with sign language videos across multiple languages using pretrained models for segmentation and embedding, achieving state-of-the-art results efficiently.

Contribution

The paper presents a novel, language-agnostic method for subtitle alignment to sign language videos that outperforms existing approaches in accuracy and efficiency.

Findings

01

State-of-the-art alignment accuracy on four datasets

02

Efficient processing within a minute for hour-long videos

03

Flexible adaptation to various resource levels

Abstract

The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Human Motion and Animation