SignBind-LLM: Multi-Stage Modality Fusion for Sign Language Translation
Marshall Thomas, Edward Fish, Richard Bowden

TL;DR
SignBind-LLM introduces a modular multi-stage approach for sign language translation, employing specialized predictors for different modalities and a transformer-based fusion to improve accuracy and handle asynchronous cues, setting new state-of-the-art results.
Contribution
The paper presents a novel modular framework that separately decodes multiple sign language modalities and fuses them with a transformer before language modeling, improving translation quality.
Findings
Achieved new state-of-the-art BLEU-4 scores on multiple datasets.
Demonstrated improved letter accuracy and translation fidelity.
Validated the effectiveness of modality-specific predictors and fusion strategy.
Abstract
Despite progress in gloss-free Sign Language Translation (SLT), traditional single modality end-to-end approaches consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in SLT with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names, places, and technical terms. We introduce SignBind-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight…
| Model | Modality | Test on How2Sign | |||
|---|---|---|---|---|---|
| Pose | RGB | BLEU-1 | BLEU-4 | ROUGE | |
| GloFE-VN [37] | ✓ | 14.9 | 2.2 | 12.6 | |
| MSLU [71] | ✓ | ✓ | 20.1 | 2.4 | 17.2 |
| SLT-IV [58] | ✓ | 34.0 | 8.0 | - | |
| C2RL [9] | ✓ | 29.1 | 9.4 | 27.0 | |
| FLa-LLM [10] | ✓ | 29.8 | 9.7 | - | |
| YouTube-ASL [61] | ✓ | 37.8 | 12.4 | - | |
| SignMusketeers [24] | ✓ | 41.5 | 14.3 | - | |
| Uni-Sign [35] | ✓ | ✓ | 40.2 | 14.9 | 36.0 |
| Geo-Sign [14] | ✓ | 40.8 | 15.1 | 35.4 | |
| SSVP-SLT [49] | ✓ | 43.2 | 15.5 | 38.4 | |
| SignBind-LLM (Ours) | ✓ | 49.4 | 22.1 | 41.2 | |
| Model Variant | L-Acc. (%) | B-4 |
|---|---|---|
| W/o Lipreading | 54.3 | 13.6 |
| W/o Fingerspelling | 33.7 | 18.9 |
| W/o Fusion | 49.8 | 8.6 |
| W/o Sequencer | 51.2 | 14.3 |
| W/o LLM | 58.8 | 7.8 |
| Full Model | 73.2 | 22.1 |
| Model | Pho.-Acc. (%) | B-4 |
|---|---|---|
| Continuous Sign | - | 6.3 |
| Lipreading | 65.6 | - |
| Model | Params (B) | L-Acc. (%) | B-4 |
|---|---|---|---|
| Llama 3.2-1B | 1.0 | 60.9 | 12.1 |
| Llama 3.2-3B | 3.0 | 64.8 | 16.4 |
| Llama 2-7B | 7.0 | 73.2 | 22.1 |
| Variant | L-Acc. (%) | B-4 |
|---|---|---|
| No shift | 68.3 | 17.2 |
| 62.8 | 12.4 | |
| 70.6 | 18.1 | |
| 55.7 | 7.8 | |
| 64.9 | 15.6 | |
| Learned alignment | 73.2 | 22.1 |
| Fusion Method | BLEU-4 |
|---|---|
| Concatenation + MLP | 12.4 |
| Cross-Attention Fusion | 22.3 |
| Gated Fusion | 22.1 |
| Method | YT-ASL | H2S | B-4 |
| \rowcolorgray!20 Zero-Shot Transfer | |||
| YT-ASL (np) [61] | 1.41 | ||
| YT-ASL (pt) [61] | 3.95 | ||
| SignBind-LLM | 8.3 | ||
| \rowcolorgray!20 H2S Only | |||
| Álvarez et al. [11] | 2.21 | ||
| GloFE-VN [37] | 2.24 | ||
| YT-ASL (np) [61] | 0.86 | ||
| YT-ASL (pt) [61] | 1.22 | ||
| Tarrés et al. [58] | 8.03 | ||
| SignBind-LLM | 13.7 | ||
| \rowcolorgray!20 Joint Training | |||
| YT-ASL (np) [61] | 5.60 | ||
| YT-ASL (pt) [61] | 11.89 | ||
| \rowcolorgray!20 Staged Pre-train Fine-tune | |||
| YT-ASL (np) [61] | 6.26 | ||
| YT-ASL (pt) [61] | 12.39 | ||
| SignBind-LLM | 22.1 | ||
| Original English Sentence | Generated Pseudo-Gloss |
|---|---|
| So here we’ve got the startings of our bon fire. | HERE WE START OUR FIRE |
| We’re going to measure it and there you can see we have it measured. | GO MEASURE IT YOU SEE WE HAVE IT MEASURED WE |
| In my case I work more from home, and I work more from the college here that I cover, than I do actually at the office. | CASE I WORK MORE HOME I WORK MORE COLLEGE HERE I COVER I ACTUALLY OFFICE |
| I have a few different ones here. | HAVE DIFFERENT HERE |
| I have here four different travel cases for your rat. | HAVE HERE FOUR TRAVEL CASE YOUR RAT I |
| Original Label | Character Sequence |
|---|---|
| Bills | B I L L S |
| political capital | P O L I T I C A L C A P I T A L |
| april | A P R I L |
| laurene simms | L A U R E N E S I M M S |
| modalities | M O D A L I T I E S |
| Pseudo-Gloss | Phoneme Sequence |
|---|---|
| HAVE MY OVER FLOW | hh ae v m ay ow v er f l ow |
| JUG COOLANT | jh ah g k uw l ah n t |
| HAVE FEW CLEAN HOW | hh ae v f y uw k l iy n hh aw |
| OUR TOOLS FEW | aw er t uw l z f y uw |
| HAVE DIFFERENT HERE | hh ae v d ih f er ah n t hh iy r |
| GO MEASURE IT | g ow m eh zh er ih t |
| YOU SEE WE HAVE | y uw s iy w iy hh ae v |
| IT MEASURED WE | ih t m eh zh er d w iy |
| Method | VE Name | VE Params | LM Name | LM Params | Total | B-4 |
| (M) | (M) | (M) | ||||
| MSLU [71] | EffNet | 5.3 | mT5-Base | 582.4 | 587.7 | 2.4 |
| SLRT [7] | EffNet | 5.3 | Transformer | 30 | 35.3 | - |
| GASLT [66] | I3D | 13.0 | Transformer | 30 | 43.0 | - |
| GFSLT-VLP [70] | ResNet18 | 11.7 | mBart | 680 | 691.7 | - |
| Sign2GPT [63] | DinoV2 | 21.0 | XGLM | 1732.9 | 1753.9 | - |
| SignLLM [21] | ResNet18 | 11.7 | LLaMA-7B | 6738.4 | 6750.1 | - |
| C2RL [9] | ResNet18 | 11.7 | mBart | 680 | 691.7 | 9.4 |
| FLa-LLM [10] | ResNet18 | 11.7 | mBart | 680 | 691.7 | 9.7 |
| Uni-Sign [35] | EffNet+GCN | 9.7 | mT5-Base | 582.4 | 592.1 | 14.9 |
| Geo-Sign [14] | GCN+Geo+Attn | 6.7 | mT5-Base | 582.4 | 589.1 | 15.1 |
| \rowcolorgray!20 Our Component Breakdown: | ||||||
| Continuous Sign Expert | DinoV2 | 101.1 | - | - | 101.1 | - |
| Fingerspelling Expert | DinoV2 (shared) | 0 (shared) | - | - | 0 | - |
| Lipreading Expert | ViT | 62.5 | - | - | 62.5 | - |
| Fusion Transformer | Attn | 33.9 | - | - | 33.9 | - |
| SignBind-LLM (Ours) | DinoV2+ViT | 197.5 | LLaMA-7B | 6738.4 | 6935.9 | 22.1 |
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW [31] |
| 0.9, 0.98 | |
| Learning Rate Schedule | Cosine Annealing |
| Initial LR | |
| Minimum LR | |
| Warmup Steps | 1,000 |
| Weight Decay | 0.01 (exclude biases, layer norms) |
| Gradient Clipping | Max norm = 1.0 |
| Dropout | 0.1 (all projections & attention) |
| Batch Size | 8 clips |
| Sequence Length | Variable (16–512 frames) |
| Epochs (Lipreading) | 100 |
| Epochs (Sign, Fingerspelling) | 30 |
| Loss Function | CTC |
| Hyperparameter | Value |
|---|---|
| Base Model | LLaMA-2-7B-hf |
| Adaptation Method | LoRA [27] |
| LoRA Rank | 16 |
| LoRA Scaling | 32 () |
| LoRA Dropout | 0.05 |
| Target Modules | (attn) |
| (MLP) | |
| Optimizer | AdamW |
| 0.9, 0.999 | |
| Learning Rate | |
| Warmup Ratio | 0.03 |
| Gradient Clipping | Max norm = 1.0 |
| Batch Size | 1 |
| Gradient Accumulation | 4 steps (effective batch = 4) |
| Sequence Length | 512 tokens |
| Epochs | 10 |
| Stage | Time | Peak Memory |
|---|---|---|
| Expert Pre-training | 72 hours | 68 GB |
| Fusion Training | 72 hours | 43 GB |
| LLM Fine-tuning | 24 hours | 78 GB |
| Total Training | 240 hours | 78 GB |
| Modality / Component | Output / Transcription |
|---|---|
| Continuous Sign | HAND WALL TREE FEEL WALL LOW YOU WHAT ALL LOOK |
| Fingerspelling | — |
| Lip Reading (Phonemes) | t uw ey n m ao l n s aa n y ao r t ow s ay d ao l y ae v t uw d uw ih z s eh n t er y r sh ow er z b ay l uh k ih ng ah p dh ah m n t ah m b ae k w eh r y uw k ey m f r ah |
| Fused | GAIN BALANCE ON TOE YOU HAVE CENTER YOU SHOULDERS LOOK UP MOUNTAIN YOU CAME FROM WHERE |
| Our Method (Combined) | So to gain that balance on your toe, all you have to do is center your shoulders by looking to the mountain. This is where you came from. |
| Reference | To gain balance on your toe side, all you have to do is center your shoulders by looking up the mountain, back where you came from. |
| Continuous Sign | POOR FLY HOUSE YOU ME DOOR |
| Fingerspelling | — |
| Lip Reading (Phonemes) | r ay t dh m d ao z t n dh eh m g ow t dh ah s t r b ay m aa z s t ae ao |
| Fused | RIGHT DOWN GO STORE BUY ME |
| Our Method (Combined) | So write them down and then go to the store and buy them for me. |
| Reference | Write them down, then go to the store. |
| Continuous Sign | WATCH MAKE |
| Fingerspelling | X P |
| Lip Reading (Phonemes) | s m n z w aa r t uw dh t ay t s ah m jh iy n s aa r n aa t t ay t ih n ah f |
| Fused | SOME TIGHT SOME NOT TIGHT |
| Our Method (Combined) | Some of them are too tight, some are not tight enough. |
| Reference | Some jeans are too tight, some jeans are not tight enough. |
| Continuous Sign | I WORK EYE BLUE |
| Fingerspelling | — |
| Lip Reading (Phonemes | t d ey m iy r g oo ih g t eh w er k aa n z t r eh ih ng n d s r eh th p ih ng m ah l ow r b aa d iy |
| Fused | TODAY WE WORK LOWER HOUSE |
| Our Method (Combined) | Today we’re working on the lower part of the house, the basement. |
| Reference | Today we’re going to work on stretching and strengthening the lower body. |
| Continuous Sign | FRIEND BAG HAT YELLOW ONE WHERE |
| Fingerspelling | — |
| Lip Reading (Phonemes | k dh ih s aa b v s l k ae ao l ow b iy y uw sh d ae z ih w ao r m ah p t uw l w eh n y eh r f er s t ih n ih k ah l iy y ah z ng ah b g |
| Fused | CAN USED AS WARM UP TOOL WHEN YOU FIRST INITIALLY USING BAG |
| Our Method (Combined) | This can be used as a warm up tool when you first are using the bag. |
| Reference | This obviously can also be used as a warm up tool when you’re first initially using a bag. |
| Continuous Sign | WHAT SHOUT |
| Fingerspelling | J I M D W D |
| Lip Reading (Phonemes | z iy ah eh m jh k m d aw d w ih dh r o a r aw d r |
| Fused | I JIM DOWN WITH ROAR |
| Our Method (Combined) | So this is Jim Down with Roar zoo. |
| Reference | I’m Jim Dowd with Zoar Outdoor. |
| Modality / Component | Output / Transcription |
|---|---|
| Continuous Sign | OAR |
| Fingerspelling | R E I T S L |
| Lip Reading (Phonemes) | ah dh r ah m t ay l z w ay l |
| Fused | OTHER REPTILES |
| Our Method (Combined) | Other reptiles include… |
| Reference | Other reptiles as well. |
| Continuous Sign | GOOD BIKE BEACH |
| Fingerspelling | — |
| Lip Reading (Phonemes) | b r ah r b ah l hh y uw m n ah n z d uw l f ah n s f eh |
| Fused | TROUBLE HUMANS DOLPHINS |
| Our Method (Combined) | Humans often cause trouble for dolphins when out at sea |
| Reference | The trouble is, humans are not built like dolphins. |
| Continuous Sign | MADE HELLO APPLE RED |
| Fingerspelling | — |
| Lip Reading (Phonemes) | ih t s n uh t l ow k m w iy k z eh r ah b f aw t ih t |
| Fused | NOT LIKE CARE IT |
| Our Method (Combined) | It’s not like we care about it. |
| Reference | It’s not a crazy theory. |
| Continuous Sign | WE FAR TREE |
| Fingerspelling | — |
| Lip Reading (Phonemes) | n ao t b er s t ih g r iy d iy ah n t s z er v ih b ah l |
| Fused | NOT SURVIVE |
| Our Method (Combined) | We’re not likely to survive this. |
| Reference | Not the best ingredients for survival. |
| Continuous Sign | CATS BIG HOUSE OUT ME 5 GOOD |
| Fingerspelling | — |
| Lip Reading (Phonemes) | m ah n f r ch n t l f ae t s b s t w z ey iy p ah b iy t ah p eh r t ah l v ao t ey k ih t aw t ih m k w ey zh ah n |
| Fused | AN FORTUNATELY CATS BEST KEEP LITTLE PARROT ALIVE TAKE OUT |
| Our Method (Combined) | Unfortunately the cats are best kept outside to keep the little parrot alive. |
| Reference | Unfortunately for the cats, the best way to keep this chubby little parrot alive is to take kitty out of the equation. |
| Continuous Sign | I ROOM TEST CAMERA |
| Fingerspelling | — |
| Lip Reading (Phonemes) | m ay s ey aw v er r p ah ah z s m er n t |
| Fused | ME SAY RAPID ASSESSMENT |
| Our Method (Combined) | I was saying this is our rapid assessment area. |
| Reference | As I was saying, this is our rapid assessment bay. |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Natural Language Processing Techniques · Hearing Impairment and Communication
SignBind-LLM: Multi-Stage Modality Fusion for Sign Language Translation
Marshall Thomas Edward Fish Richard Bowden
CVSSP, University of Surrey
Guildford, Surrey, United Kingdom
{marshall.thomas, edward.fish, r.bowden}@surrey.ac.uk
Abstract
Despite progress in gloss-free Sign Language Translation (SLT), traditional single modality end-to-end approaches consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in SLT with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names, places, and technical terms. We introduce SignBind-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight transformer that resolves temporal misalignments before passing the combined representation to a Large Language Model (LLM) for final sentence generation. Our method establishes a new state-of-the-art on the How2Sign, ChicagoFSWildPlus, and BOBSL datasets with a BLEU-4 score of 22.1, 73.2% letter accuracy and BLEU-4 score of 6.8 respectively. These results validate our core hypothesis: isolating and solving distinct recognition tasks before fusion provides a more powerful and effective pathway to robust, high-fidelity sign language translation.
1 Introduction
Sign languages are fully-fledged natural languages used by millions of Deaf people worldwide. Sign languages rely on manual gestures (handshapes, movements, and locations) and non-manual signals (facial expressions, mouthings, and mouth gestures111In sign language linguistics, mouthing denotes silent articulation of spoken words that accompany signs, while mouth gestures are non-verbal movements integral to the sign itself, conveying grammatical or affective nuance). Languages like American Sign Language (ASL), British Sign Language (BSL), and German Sign Language (DGS) [25, 5, 19] each possess their own lexicon, grammar, fingerspelling method, and mouthings.
Early methods focused on Isolated Sign Language Recognition (ISLR), typically required expertly annotated glosses and details English translations for individual signs [6]. While the advent of large, gloss-free datasets (e.g., YouTube-ASL [61], Public DGS [26], How2Sign [11]) and the integration of large language models (LLMs) [37] have moved the field toward end-to-end Sign Language Translation (SLT), two pervasive challenges continue to limit real-world performance:
Fingerspelling: Fingerspelling is a critical component of all sign languages and is often used for proper names, technical terms, and loanwords. This constitutes a significant portion of signing (e.g., 12–35% of ASL [62]). However, current gloss-free SLT systems [36, 39], treat it as a subordinate task within the main translation pipeline while other SOTA methods ignore it entirely [35, 14]. As a result, fingerspelled terms are frequently mistranslated.
Non-Manual Cues: Mouthings and facial expressions carry crucial disambiguating information, akin to phonemes or visemes in speech [12, 43]. Yet, existing SLT models rarely utilize these powerful signals. The key challenge in integrating these features in end-to-end approaches is that they are not aligned with manual signs, particularly during fingerspelling.
In contrast to advances in Automatic Speech Recognition (ASR) [38] and Visual Speech Recognition (VSR) [59], which leverage massive pretraining and powerful architectures, sign language systems remain constrained. This discrepancy is primarily due to (a) data scarcity; (b) the spatio-temporal complexity of co-articulated gestures; and critically, (c) the unaddressed temporal offset between these manual and non-manual signals.
To overcome these fundamental limitations, we introduce SignBind-LLM, a novel gloss-free SLT framework built on a paradigm of multi-modal, parallel-stream processing. The novelty lies in its dedicated, three-pronged architecture designed to explicitly solve the issue of non-manual misalignment, temporal variance, fingerspelling recognition, and the integration of lipreading. We make the following contributions:
Dedicated Modality Streams: We implement three specialized processing streams. One for continuous sign, one dedicated sub-network for fine-grained fingerspelling detection, and one dedicated visual speech recognition (lipreading) component which are pre-trained independently for their respective tasks.
Asynchronicity-Aware Fusion Encoder: We introduce a lightweight transformer encoder equipped with a learnable gating mechanism that dynamically weighs the contribution of each visual stream before temporal integration. The Fusion Encoder adaptively balances information from the active modality (i.e signing or fingerspelling) and the lipreading branch, enabling the model to emphasize the most informative cues under varying temporal asynchrony.
Contextual LLM Decoding: We train a specialised LLM for decoding fused representations in sign gloss order to English sentences, generating contextually accurate and fluent translations without relying on any direct gloss supervision. We further provide LLM generated pseudo-gloss annotations and phonemes for the 1.5M sentences in BOBSL [1] and 35k in How2Sign [11].
We demonstrate the efficacy of this new paradigm on How2Sign, ChicagoFSWildPlus, and BOBSL, with large improvements over SOTA end-to-end approaches ( B4 on How2Sign and B4 on BOBSL). Our findings demonstrate that combining modality-specific, parallel predictors with an LLM decoder is the scalable path forward for high-quality translation of real-world signed content.
2 Related Work
2.1 Sign Language Understanding
Sign Language Understanding (SLU) has evolved through three increasingly ambitious tasks:
Isolated Sign Language Recognition (ISLR): Early work focused on recognizing individual signs in controlled settings. Methods such as 3D CNNs over spatiotemporal volumes [41] and hybrid CNN–HMM pipelines [32] achieved high accuracy on small vocabularies (e.g., 100–500 signs), but relied on carefully segmented clips and extensive manual gloss annotations.
Continuous Sign Language Recognition (CSLR): To scale beyond isolated signs, CSLR methods adopted sequence‐level training with Connectionist Temporal Classification (CTC) loss [23]. Recurrent architectures (CNN–BiLSTM–CTC) [22] and, more recently, Transformer encoders [41] enabled end‐to‐end gloss prediction from unsegmented video. While these models tolerate variable sign durations, they still depend on gloss‐level transcripts which are costly to annotate. Furthermore CSLR methods often struggle with out‐of‐vocabulary signs and fingerspelling.
Sign Language Translation (SLT): SLT goes one step further by generating natural language sentences directly from sign video. SLT methods can largely be categorised into two groups. Gloss‐based SLT pipelines [68, 7, 67] utilise sing-aligned annotated data to predict gloss sequences as in the CSLR case above. These can then be used as input via a sequence‐to‐sequence model [6] to improve translation quality. Examples include CNN–LSTM–attention frameworks that achieve fluent translation but inherit errors from the gloss recognizer. Alternatively, Gloss‐free SLT [10, 71, 39, 21, 3] bypasses intermediate glosses, learning a direct video‐to‐text mapping. Gloss-free methods have the advantage that they can leverage large datasets without accurate and aligned gloss-level translations. This enables large‐scale video–text pretraining [70, 37, 71, 58] and multimodal contrastive objectives [24] to improve translation fluency. More recent approaches have focussed on leveraging LLM’s language modelling capabilities to incorporate sign features either from RGB [61], Pose [14], or a fusion of both modalities [35, 63]. However, these models often underperform on fingerspelling and non‐manual cues, since they attempt to learn all visual subtasks in a single end‐to‐end network.
2.2 Lipreading and Visual Speech Recognition
Lipreading, or visual speech recognition (VSR), can complement sign language understanding by capturing silent speech cues. Modern systems employ deep learning-based architectures. LipNet [50] introduced an end-to-end model using spatiotemporal CNNs [33, 8] and BiLSTMs [22], achieving state-of-the-art sentence-level lipreading performance. Transformer-based models, such as Lipformer [64], further improved contextual understanding across video frames. Recent advances leverage self-supervised learning and multimodal fusion to enhance lipreading robustness. AV-HuBERT [52] demonstrated the efficacy of large-scale audiovisual pretraining, achieving strong performance under challenging conditions [59]. However, lipreading remains underexplored in the context of fingerspelling and full‐sentence SLT, despite mouthing cues carrying up to 40% of linguistic information in natural sign production (e.g., lexical disambiguation, grammatical markers) [2].
2.3 Multimodal Approaches
Multimodal learning has emerged as a promising approach to improve SLT by integrating visual, linguistic, and contextual information [29]. Previous systems fused RGB video, pose sequences, and optical flow [14] to enhance recognition accuracy. Recent works such as GFSLT-VLP [70] incorporated video-text contrastive learning and cross-modal attention mechanisms to bridge the gap between sign language and natural language representations.
However, current fingerspelling detection methods remain focused on hand centric cues and fail to leverage the rich information present in lip movements. Approaches such as synthetic data augmentation [15], large “in‐the‐wild” collections [69, 20], and pose‐based models achieve moderate accuracy on isolated letter streams but do not integrate mouthings. This design overlooks the well‐documented co‐articulation between handshapes and lip movements in natural signing [2], leading to persistent errors on homonymous handshapes (e.g., “M” vs. “N”) and proper‐noun detection. To date, no SLT system has explicitly fused high‐resolution fingerspelling and lipreading streams while accounting for their temporal misalignment.
2.4 LLMs in SLT and VSR
The rise of Large Language Models (LLMs) has revolutionized sign language understanding, fingerspelling detection, and lipreading by leveraging vast linguistic knowledge and multimodal learning capabilities. In SLT, LLMs have been employed to bridge the gap between sign video sequences and natural language text [14, 35]. Newer approaches [48] applied transformer-based architectures to generate coherent text from sign videos, outperforming traditional gloss-based approaches. For fingerspelling detection, LLMs have facilitated cross-modal learning by aligning hand gestures with corresponding textual representations [57, 51]. These approaches demonstrated that LLMs can effectively enhance fingerspelling recognition by integrating linguistic and visual modalities but they have not previously been combined with lipreading.
In lipreading, LLMs have been integrated into audiovisual speech recognition pipelines to improve contextual understanding and transcription accuracy [59]. The first approach to utilize LLM-based encoders [65] enhanced lipreading robustness, even under challenging conditions.
Despite progress in each of these individual modality streams and the recent availability of large sign language datasets and language models, no-one has yet proposed a method which combines lipreading, fingerspelling, and sign recognition in a unified framework for SLT.
3 Method
We propose SignBind-LLM, a novel sign language translation framework that decomposes the complex translation task into specialized sub-tasks before fusion. Our key insight is that sign languages employ distinct communication channels—continuous signing, fingerspelling, and mouthings—that benefit from dedicated modeling before integration. We achieve this through a four-stage pipeline: (1) Target Generation via Text Pre-processing, (2) Modality-Specific Pre-training, (3) Multi-modal Fusion, and (4) Language Model Refinement.
Figure 2 illustrates our complete architecture.
3.1 Problem Formulation
Given an input video sequence of Sign Language , where represents frame , our primary objective is to produce an English translation . We formally decompose this mapping, , into a series of specialized functions:
[TABLE]
Here, , , and denote the expert functions for continuous signing, fingerspelling, and lipreading. Their outputs are integrated into a unified pseudo-gloss representation , which is finally translated into the target sentence .
3.2 Stage 1: Target Generation via Text Pre-processing
A significant barrier to multi-stream modeling is the lack of parallel, granular annotations. We overcome this by automatically generating three complementary training targets directly from English subtitles .
Pseudo-Gloss Generation: To create a compact and linguistically relevant target for the sign recognition task, we employ GPT-4o [47] to transform English sentences into ASL-ordered pseudo-glosses. This process removes function words and reorders content to match common signing patterns.
[TABLE]
Here, reflects the removal of semantically redundant elements such as articles, and the re-ordering to subject, object, verb structure.
Phoneme Extraction: Mouthings provide critical disambiguation for visually similar signs. To supervise our lipreading module, we extract phoneme sequences from the generated pseudo-glosses:
[TABLE]
Each represents an English phoneme, enabling our model to learn the fine-grained articulatory motions of mouthed words.
Fingerspelling Identification: For the Fingerpelling labels we use the ChicagoFSWild+ dataset and split the words into individual character sequences such that . Each represents a character in the word label, enabling the model to learn each letter in the alphabet. We also isolate sequences that contain potential fingerspelling from the psuedo-glosses by applying rule-based detection to . We do this by identifying proper nouns and technical terms, which are typically fingerspelled.
3.3 Stage 2: Modality-Specific Encoders
This stage defines the “expert” networks that learn to recognize each specific communication channel.
Video Pre-processing: We process the input video into two distinct streams to feed the appropriate experts. We use MediaPipe [40] to extract both the full frames (for manual signs) and tightly-cropped face frames (for lipreading).
[TABLE]
Both streams are normalized with a resolution of , where is the batch size.
Shared Backbone and Manual Experts: The core of our visual encoder is a single DINOv2 backbone [42] that processes the full-frame stream from . The features extracted from this backbone, , are shared, serving as the input for three independent parallel heads.
Dynamic Sequence Routing: A key innovation of our work is a dynamic routing module that operates on the shared backbone features . We find that applying all experts to all frames is inefficient and introduces noise and so we train a lightweight classifier to route segments to the correct expert: .
First, we apply temporal-average pooling to the DINOv2 features to get a sequence-level representation, A small MLP head then processes this feature to produce class logits.
[TABLE]
To maintain differentiability for training, we use the Gumbel-Softmax [28] reparameterization trick on the raw logits to obtain a one-hot-like selection vector :
[TABLE]
where is the temperature parameter. This routing vector is passed to the fusion stage (Stage 3) to dynamically weight the outputs of the expert heads.
Continuous Sign Recognition: In parallel with the router, the continuous signing expert processes the full, unpooled DINOv2 features. We add positional embeddings () and pass the features through a dedicated linear head to project to the sign vocabulary size :
[TABLE]
Fingerspelling Detection: Similarly, the fingerspelling expert processes the same position-encoded features, , but uses its own dedicated linear head to produce logits over the 26 characters of the alphabet:
[TABLE]
Lipreading Module: The lipreading expert processes the face-cropped stream . We employ a masked ViT () with a 50% masking ratio to force the model to learn robust representations. A 1D convolution then performs temporal adaptation, and a final linear layer projects to the phoneme vocabulary :
[TABLE]
Note that the temporal dimension due to the convolutional pooling.
3.4 Stage 3: Temporal-Aware Multi-modal Fusion
This stage addresses the core challenge of fusing our asynchronous experts. First, we project all outputs to a common dimension and align their temporal lengths. The lipreading logits are upsampled from to :
[TABLE]
Gated Manual Feature Aggregation: We use the routing vector from Stage 2 to create a single, unified manual representation . This dynamically selects the correct expert output (sign or fingerspelling) for each segment.
[TABLE]
Here, denotes element-wise multiplication, and is a learned null vector to represent rest periods.
Adaptive Feature Gating: The relative importance of manual signs versus mouthings is context-dependent. We learn this balance with an adaptive gating mechanism. A learned gate controls the information flow from the manual stream and the lip stream :
[TABLE]
where denotes feature concatenation.
Transformer-based Temporal Modelling: The resulting fused features are passed through a final TransformerEncoder to model long-range temporal dependencies, producing the final contextual representation :
[TABLE]
This fusion module is trained to produce the correct pseudo-gloss sequence via a final CTC loss:
[TABLE]
3.5 Stage 4: Language Model Refinement
The fused representation is a sequence of pseudo-glosses, which is not representative of spoken English. The final stage employs a language model, , to translate this intermediate representation into the final sentence where This model is trained independently of the visual pipeline. We pre-train on a large corpus of (pseudo-gloss, English sentence) pairs generated in Stage 1. This decoupled approach allows the LLM to learn a robust linguistic mapping that generalizes well to the noisy gloss outputs produced by the visual model at inference time.
3.6 Training Strategy
Our curriculum is strictly staged to ensure stable convergence and component specialization. We do not use a single, joint end-to-end loss. Instead, each module is trained independently, with its parameters frozen before being used by the next stage. This process is applied in two phases: pre-training and fine-tuning.
Phase 1: Stage-wise Pre-training: We first pre-train each component on large-scale, diverse datasets (e.g., YouTube-ASL [61], ChicagoFSWild+).
Train Experts: The Sequence Classifier , Sign Encoder , FS Encoder , and Lipreading module are all trained separately on their respective targets from Stage 1. The losses for this stage are defined as:
[TABLE]
where are the output logits from branch and the corresponding ground truth token sequence. 2. 2.
Freeze Experts & Train Fusion: After the experts are trained, their weights are frozen. The Fusion Module is then trained on the frozen outputs of the experts to optimize . The loss for the fusion is defined as:
[TABLE] 3. 3.
Train LLM: Separately, the Language Model is pre-trained on all available text pairs to optimize using standard Cross Entropy Loss.
**Phase 2: Stage-Wise Fine-tuning: ** We repeat the same staged process on the smaller, high-quality fine-tuning dataset (e.g., How2Sign [11]).
Fine-tune Experts: The pre-trained experts are fine-tuned on the How2Sign data, again using their independent losses. 2. 2.
Freeze Experts & Fine-tune Fusion: The fine-tuned expert weights are frozen. The pre-trained Fusion Module is then fine-tuned on their outputs. 3. 3.
LLM: The LLM is already fully trained and remains frozen during this phase.
This decoupled, staged methodology ensures that each component becomes a robust expert at its specific task before its outputs are used to train subsequent modules.
4 Experiments
4.1 Datasets
For ASL we pre-trained our model using the Youtube-ASL dataset and ChicagoFSWild+, then fine-tuned it on How2Sign. We then evaluated our proposed framework on How2Sign and ChicagoFSWild+. For BSL we trained and evaluated our model on BOBSL [1].
YouTube-ASL [61]: A large-scale ASL dataset comprising 60K videos and 1,000 hours sourced from YouTube, annotated with sentence-level transcriptions. This dataset provides diverse signing conditions, including variations in background, lighting, and signer appearance. The dataset features over 2,500 unique signers, ensuring that the model generalizes well across different users.
ChicagoFSWildPlus [4]: A dataset containing 55,232 ASL finger-spelling sequences performed by 260 signers containing in-the-wild videos of fingerspelling sequences. It includes multiple signers, varied environments, and occlusions, which makes it ideal for evaluating the robustness of fingerspelling detection.
How2Sign [11]: How2Sign is a large‐scale, continuous ASL corpus derived from 2,456 instructional “How2” videos, totaling over 80 hours of footage. Recordings were made in two settings: a Green-Screen studio (79.1 h over 2,529 videos) and the Panoptic studio (2.96 h over 124 videos). Eleven signers (5 hearing ASL interpreters, 2 hard-of-hearing, 4 Deaf) produced 35,191 sentence‐level clips (average 162 frames/5.4 s, 17 words), yielding a vocabulary of over 16,000 English words.
BOBSL [1]: A large-scale video dataset consisting of British Sign Language from BSL interpreted BBC broadcast footage. The data features 39 unique interpreters from 1,962 episodes across 426 TV shows, resulting in approximately 1,467 hours of video content. The videos are annotated with English subtitles approximately 1.2M sentences.
4.2 Evaluation Metrics
We evaluated model performance using several common metrics: Letter Accuracy [18], BLEU Score [46] and ROUGE [17]. For all metrics higher scores demonstrate improved performance.
5 Results
In this section we compare the performance of SignBind-LLM with several state-of-the-art models from recent literature. Table 1 and Table 2 show the comparison between our approach and other approaches in SLT on How2Sign and BOBSL respectively. Table 3 shows the comparison for fingerspelling on the ChicagoFSWildPlus dataset.
5.1 Qualitative Results
Figure 2 shows three example translations, with the outputs from each stream. It aslo shows where the translation process fails and how the fusion transformer helps to remedy this. Finally the figure shows how the LLM generates the spoken-English sentence from the fusion predictions. We observe that the mouthed features provide a strong signal for the fusion encoder, while the LLM is effective at correcting grammatical errors in the translation.
5.2 Ablation Study
In this section, we discuss the different ablation studies performed to demonstrate the contribution of the various components in the architecture.
Model Variants: The first ablation was to understand the contribution of each component and the importance of each modality. We conducted an ablation study by selectively disabling key features of SignBind-LLM and comparing the results. Table 4 presents the results. We identify that the fusion network and LLM are key for improving performance, while lipreading is the most important modality.
Stand-alone Effectiveness: This ablation focuses on the effectiveness of the two primary experts, the continuous sign predictor and the Lipreading predictor. Table 5 shows the phoneme prediction performance and continuous sign prediction performance of the model on the How2Sign dataset from the sign CTC branch directly. We observe that lipreading is a critical component of the network delivering most of the performance improvement during fusion. The low continuous sign score can also be attributed to the misalignment between this branch and the English translation before the LLM reordering.
Varying LLMs: The next ablation was to compare the effectiveness of different LLMs, fully fine-tuned with different parameter sizes. Table 6 shows the results of this study.
Asynchronous Fusion: The penultimate ablation focuses on the effectiveness of the fusion model. The aim is to quantify how sensitive our fusion mechanism is to the temporal misalignment that naturally occurs between hand movements (manual signals) and mouth movements (non‐manual cues) in real sign language. To quantify this we test with four different temporal shifts:
No shift: Directly fuse frame of lips with frame of hands, no temporal shift at all 2. 2.
5 frames: fuse frame t of hands with lip frames 3. 3.
10 frames: fuse frame t of hands with lip frames 4. 4.
Learned Alignment: The model learn an optimal per‐time fusion gating
6 Conclusion
We introduced SignBind-LLM, a modular framework that redefines gloss-free Sign Language Translation through explicit multi-stage fusion of continuous signing, fingerspelling, and lipreading. By decomposing SLT into dedicated expert streams and resolving their temporal asynchrony via a lightweight transformer, our model achieves state-of-the-art results on How2Sign (BLEU-4 of 22.1), BOBSL (BLEU-4 of 6.8) and ChicagoFSWildPlus (73.2% letter accuracy). Our findings validate that isolating and reconciling heterogeneous visual-linguistic cues before fusion leads to SOTA performance on sign language translation.
7 Acknowledgements
This work was supported by the SNSF project ‘SMILE II’ (CRSII5 193686), the Innosuisse IICT Flagship (PFFS-21 47), EPSRC grant APP24554 (SignGPT-EP/Z535370/1) and through funding from Google.org via the AI for Global Goals scheme. This work reflects only the author’s views and the funders are not responsible for any use that may be made of the information it contains. Thank you to Oline Ranum for help with the parts of Speech analysis.
Contents
Appendix A Introduction
This supplementary material provides comprehensive technical details and additional ablation experiments for our proposed method.
The document is organized as follows:
- •
Appendix B** – Extended Ablation Studies:** Detailed comparisons of fusion architectures and zero-shot generalization experiments, quantifying the benefits of large-scale pre-training.
- •
Appendix C** – Part-of-Speech Analysis:** Fine-grained linguistic analysis across 16 POS categories, revealing our model’s strengths in content word prediction and the trade-off between visual fidelity and grammatical fluency.
- •
Appendix D** – Implementation Details:** Complete experimental setup including pseudo-glossing pipeline, phoneme extraction, model architecture specifications, training hyperparameters, and computational requirements.
- •
Appendix E** – Qualitative Translation Analysis:** Extensive translation examples from How2Sign and BOBSL, showing outputs from each expert stream and demonstrating how the Fusion Encoder resolves ambiguities.
Appendix B Extended Ablation Studies
In the main paper, we demonstrated that our Gated Fusion mechanism achieves state-of-the-art performance. Here, we analyze alternative fusion strategies and the model’s zero-shot generalization capabilities.
B.1 Analysis of Fusion Strategies
As dicussed in the main paper, Sign Language translation faces a unique challenge: temporal asynchrony. The manual sign for a concept often occurs slightly before or after the corresponding mouthing. A phenomenon well-documented in sign language linguistics but rarely addressed in computational models. We hypothesized that a simple concatenation of features would fail to capture this dynamic relationship, and that explicit gating mechanisms would be necessary to learn when to rely on each modality.
To validate this hypothesis, we compared three distinct fusion strategies:
Concatenation + MLP: A naive baseline where visual features from all streams (manual + lip) are concatenated at each timestep and projected back to a common dimension via a two-layer MLP with GELU activation and dropout (). This MLP serves as a learned mixing function without any explicit attention or content-adaptive weighting. The resulting fused features are then fed directly into the Fusion Encoder. 2. 2.
Cross-Attention Fusion: A standard Transformer-based approach where the manual stream features act as queries and attend over the lipreading features as keys and values. The output of the cross-attention block is added to the manual representation via a residual connection and layer normalization. This allows full bidirectional interaction between modalities but comes at significant computational cost. 3. 3.
Gated Fusion (Ours): Our proposed mechanism that dynamically weighs the importance of the lipreading stream based on a learned gating function applied to the manual stream features. This lightweight approach (single linear layer + sigmoid) explicitly models the confidence of the manual predictor and adaptively suppresses or emphasizes lip information accordingly.
As shown in Table 8, the Concatenation baseline performs poorly (12.4 BLEU-4), representing a 9.7 point drop from our full model. We attribute this to “noise injection” where without a gating mechanism, the model cannot suppress the lipreading stream during periods of silence or irrelevant mouth movements (such as natural facial expressions unrelated to linguistic content), leading to hallucinations and semantic drift.
Cross-Attention improves substantially over concatenation and provides a negligible improvement over the gated fusion method, demonstrating that bidirectional interaction between modalities is beneficial. However, this approach introduces significant computational overhead. Cross-attention requires operations per layer, whereas our gating mechanism requires only operations.
Our Gated Fusion achieves comparative performance (22.1 BLEU-4) by explicitly learning when to rely on lip patterns (e.g., during fingerspelling sequences or when manual signs are ambiguous) and when to ignore them (e.g., during non-linguistic facial expressions or signer speech).
B.2 Zero-Shot Generalization and Pre-training Effects
We further investigated the transferability of our learned representations by evaluating zero-shot generalization from the large-scale Youtube-ASL dataset to the smaller, controlled How2Sign dataset. As shown in Section C.1, when trained solely on Youtube-ASL (1,000 hours, diverse signers and conditions) and evaluated on How2Sign without any fine-tuning, the model achieves a BLEU-4 of . While substantially lower than the supervised baseline, this is a non-trivial result for a zero-shot gloss-free system. For comparison, in the original YT-ASL paper [61] the authors report a B4 score of just .
Notably, training only on How2Sign (without Youtube-ASL pre-training) yields 13.7 BLEU-4 which is significantly worse than our two-stage approach (22.1). This demonstrates that the “in-the-wild” diversity of Youtube-ASL teaches the model robust, signer-independent features for phonemes, handshapes, and their temporal relationships. The controlled How2Sign environment, while higher quality, lacks the variability necessary for the model to learn truly generalizable representations. This is still significantly better than the other approaches shown in Section C.1.
Appendix C Part-of-Speech (POS) Analysis
A common issue with Sign Language Translation methods is when the model predicts correct content words (nouns, verbs) but fails to construct a grammatically valid sentence with appropriate function words (prepositions, determiners, auxiliary verbs). This failure is particularly prevalent in gloss-based approaches, since intermediate gloss representations typically omit function words entirely. To evaluate whether our model exhibits this same behaviour we conducted a comprehensive Part-of-Speech analysis.
C.1 Methodology
We ran Part-of-Speech tagging using the spaCy English language model (en_core_web_sm) on both the ground truth How2Sign references and our model’s generated translations. For each sentence, we extracted the distribution of POS tags and computed the accuracy for each tag compared with two SOTA approaches for SLT, Geo-Sign [14] and C2RL [9].
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Albanie et al. [2021] Samuel Albanie, Gül Varol, Liliane Momeni, Hannah Bull, Triantafyllos Afouras, Himel Chowdhury, Neil Fox, Bencie Woll, Rob Cooper, Andrew Mc Parland, and Andrew Zisserman. BOBSL: BBC-Oxford British Sign Language Dataset. 2021.
- 2Aparicio et al. [2017] Mario Aparicio, Philippe Peigneux, Brigitte Charlier, Danielle Balériaux, Martin Kavec, and Jacqueline Leybaert. The neural basis of speech perception through lipreading and manual cues: Evidence from deaf native users of cued speech. Neuropsychologia , 2017.
- 3Asasi et al. [2025] Sobhan Asasi, Mohamed Ilyes Lakhal, and Richard Bowden. Hierarchical feature alignment for gloss-free sign language translation. In Adjunct Proceedings of the 25th ACM International Conference on Intelligent Virtual Agents , 2025.
- 4B. Shi and Livescu [2019] J. Keane D. Brentari G. Shakhnarovich B. Shi, A. Martinez Del Rio and K. Livescu. Fingerspelling recognition in the wild with iterative visual attention. ICCV , 2019.
- 5British Sign Language [2024] British Sign Language. British sign language resources. https://www.british-sign.co.uk/ , 2024.
- 6Camgoz et al. [2018] Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. Neural sign language translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018.
- 7Camgoz et al. [2020] Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020.
- 8Carreira and Zisserman [2017] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2017.
