Improving Continuous Sign Language Recognition with Adapted Image Models
Lianyu Hu, Tongkai Shi, Liqing Gao, Zekang Liu, Wei Feng

TL;DR
This paper introduces AdaptSign, a lightweight adaptation strategy for large vision-language models like CLIP, enabling efficient and effective continuous sign language recognition while preserving pretraining knowledge.
Contribution
AdaptSign employs fixed CLIP features with learnable modules for spatial and temporal modeling, achieving high efficiency and superior performance in CSLR tasks.
Findings
AdaptSign outperforms existing CSLR methods on multiple benchmarks.
The additional modules only add 3.2% extra computations.
Visualizations show effective focus on informative regions and trajectories.
Abstract
The increase of web-scale weakly labelled image-text pairs have greatly facilitated the development of large-scale vision-language models (e.g., CLIP), which have shown impressive generalization performance over a series of downstream tasks. However, the massive model size and scarcity of available data limit their applications to fine-tune the whole model in downstream tasks. Besides, fully fine-tuning the model easily forgets the generic essential knowledge acquired in the pretraining stage and overfits the downstream data. To enable high efficiency when adapting these large vision-language models (e.g., CLIP) to performing continuous sign language recognition (CSLR) while preserving their generalizability, we propose a novel strategy (AdaptSign). Especially, CLIP is adopted as the visual backbone to extract frame-wise features whose parameters are fixed, and a set of learnable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Human Pose and Action Recognition
MethodsSparse Evolutionary Training · Circular Smooth Label · Contrastive Language-Image Pre-training
