SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 55+ Sign Languages
Sen Fang, Hongbin Zhong, Yanxin Zhang, Dimitris N. Metaxas

TL;DR
SignVerse-2M is a large-scale, pose-native multilingual sign language dataset with over two million clips across 55+ languages, designed for open-world recognition and pose-driven sign language modeling.
Contribution
It introduces a comprehensive pose-native dataset built from real-world videos, enabling better sign language recognition and translation in open scenarios.
Findings
The dataset covers 55+ sign languages with 2 million clips.
A unified preprocessing pipeline converts videos into pose sequences.
A baseline transformer model demonstrates the dataset's utility.
Abstract
Existing large-scale sign language resources typically provide supervision only at the level of raw video-text alignment and are often produced in laboratory settings. While such resources are important for semantic understanding, they do not directly provide a unified interface for open-world recognition and translation, or for modern pose-driven sign language video generation frameworks: 1. RGB-based pretrained recognition models depend heavily on fixed backgrounds or clothing conditions during recording, and are less robust in open-world settings than style-agnostic pose-processing models. 2. Recent pose-guided image/video generation models mostly use a unified keypoint representation such as DWPose as their control interface. At present, the sign language field still lacks a data resource that can directly interface with this modern pose-native paradigm while also targeting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
