Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media
Shakib Yazdani, Yasser Hamidullah, Cristina Espa\~na-Bonet, Josef van Genabith

TL;DR
This paper presents an automated, VLM-assisted pipeline for sign language data collection and annotation from social media, significantly reducing manual effort and enabling scalable dataset creation for multiple languages.
Contribution
It introduces the first automated framework using Vision Language Models for sign language dataset annotation and filtering from social media videos.
Findings
Created TikTok-SL-8 dataset for eight sign languages.
Evaluated two SLT models on automatically filtered data.
Demonstrated the pipeline's effectiveness in reducing manual annotation effort.
Abstract
Most existing sign language translation (SLT) datasets are limited in scale, lack multilingual coverage, and are costly to curate due to their reliance on expert annotation and controlled recording setup. Recently, Vision Language Models (VLMs) have demonstrated strong capabilities as evaluators and real-time assistants. Despite these advancements, their potential remains untapped in the context of sign language dataset acquisition. To bridge this gap, we introduce the first automated annotation and filtering framework that utilizes VLMs to reduce reliance on manual effort while preserving data quality. Our method is applied to TikTok videos across eight sign languages and to the already curated YouTube-SL-25 dataset in German Sign Language for the purpose of additional evaluation. Our VLM-based pipeline includes a face visibility detection, a sign activity recognition, a text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
