SLVideo: A Sign Language Video Moment Retrieval Framework
Gon\c{c}alo Vinagre Martins, Jo\~ao Magalh\~aes, Afonso Quinaz, Carla, Viegas, Sofia Cavaco

TL;DR
SLVideo is a novel sign language video retrieval system that uses facial and hand embeddings, enabling text-based search and sign similarity queries in Portuguese Sign Language videos.
Contribution
It introduces a comprehensive retrieval framework that incorporates facial expressions and a thesaurus for sign similarity, advancing sign language video search technology.
Findings
Promising zero-shot retrieval performance
Effective use of CLIP embeddings for sign language
Supports sign similarity and annotation editing
Abstract
SLVideo is a video moment retrieval system for Sign Language videos that incorporates facial expressions, addressing this gap in existing technology. The system extracts embedding representations for the hand and face signs from video frames to capture the signs in their entirety, enabling users to search for a specific sign language video segment with text queries. A collection of eight hours of annotated Portuguese Sign Language videos is used as the dataset, and a CLIP model is used to generate the embeddings. The initial results are promising in a zero-shot setting. In addition, SLVideo incorporates a thesaurus that enables users to search for similar signs to those retrieved, using the video segment embeddings, and also supports the edition and creation of video sign language annotations. Project web page: https://novasearch.github.io/SLVideo/
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsContrastive Language-Image Pre-training · Focus
