SignCLIP: Connecting Text and Sign Language by Contrastive Learning

Zifan Jiang; Gerard Sant; Amit Moryossef; Mathias M\"uller; Rico; Sennrich; Sarah Ebling

arXiv:2407.01264·cs.CL·October 8, 2024

SignCLIP: Connecting Text and Sign Language by Contrastive Learning

Zifan Jiang, Gerard Sant, Amit Moryossef, Mathias M\"uller, Rico, Sennrich, Sarah Ebling

PDF

Open Access 1 Repo 1 Video

TL;DR

SignCLIP adapts CLIP to embed spoken language text and sign language videos into a shared space, enabling effective cross-modal sign language understanding without task-specific training.

Contribution

It introduces a scalable, multilingual pretraining approach for sign language processing using contrastive learning on large video-text datasets.

Findings

01

High accuracy in in-domain text-video retrieval

02

Competitive performance on out-of-domain sign recognition

03

Provides linguistic insights through latent space analysis

Abstract

We present SignCLIP, which re-purposes CLIP (Contrastive Language-Image Pretraining) to project spoken language text and sign language videos, two classes of natural languages of distinct modalities, into the same space. SignCLIP is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs, without directly optimizing for a specific task or sign language which is often of limited size. We pretrain SignCLIP on Spreadthesign, a prominent sign language dictionary consisting of ~500 thousand video clips in up to 44 sign languages, and evaluate it with various downstream datasets. SignCLIP discerns in-domain signing with notable text-to-video/video-to-text retrieval accuracy. It also performs competitively for out-of-domain downstream tasks such as isolated sign language recognition upon essential few-shot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

J22Melody/fairseq
pytorchOfficial

Videos

SignCLIP: Connecting Text and Sign Language by Contrastive Learning· underline

Taxonomy

TopicsHearing Impairment and Communication · Hand Gesture Recognition Systems · linguistics and terminology studies

MethodsContrastive Language-Image Pre-training