SignBind-LLM: Multi-Stage Modality Fusion for Sign Language Translation

Marshall Thomas; Edward Fish; Richard Bowden

arXiv:2509.00030·cs.CL·December 5, 2025

SignBind-LLM: Multi-Stage Modality Fusion for Sign Language Translation

Marshall Thomas, Edward Fish, Richard Bowden

PDF

Open Access

TL;DR

SignBind-LLM introduces a modular multi-stage approach for sign language translation, employing specialized predictors for different modalities and a transformer-based fusion to improve accuracy and handle asynchronous cues, setting new state-of-the-art results.

Contribution

The paper presents a novel modular framework that separately decodes multiple sign language modalities and fuses them with a transformer before language modeling, improving translation quality.

Findings

01

Achieved new state-of-the-art BLEU-4 scores on multiple datasets.

02

Demonstrated improved letter accuracy and translation fidelity.

03

Validated the effectiveness of modality-specific predictors and fusion strategy.

Abstract

Despite progress in gloss-free Sign Language Translation (SLT), traditional single modality end-to-end approaches consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in SLT with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names, places, and technical terms. We introduce SignBind-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight…

Tables18

Table 1. Table 1 : Modality (pose / RGB) and BLEU / ROUGE comparison on How2Sign. SignBind-LLM achieves the best scores across all metrics.

Model	Modality		Test on How2Sign
Model	Pose	RGB	BLEU-1 $↑$	BLEU-4 $↑$	ROUGE $↑$
GloFE-VN [37]	✓		14.9	2.2	12.6
MSLU [71]	✓	✓	20.1	2.4	17.2
SLT-IV [58]		✓	34.0	8.0	-
C²RL [9]		✓	29.1	9.4	27.0
FLa-LLM [10]		✓	29.8	9.7	-
YouTube-ASL [61]	✓		37.8	12.4	-
SignMusketeers [24]		✓	41.5	14.3	-
Uni-Sign [35]	✓	✓	40.2	14.9	36.0
Geo-Sign [14]	✓		40.8	15.1	35.4
SSVP-SLT [49]		✓	43.2	15.5	38.4
SignBind-LLM (Ours)		✓	49.4	22.1	41.2

Table 2. Table 2 : Comparison on BOBSL in BLEU-1, BLEU-4, and ROUGE-L. SignBind-LLM performed best across all metrics.

Model	B1	B4	R-L
GFSLT [70]	-	0.6	7.4
Sign2GPT [63]	-	0.9	11.4
Albanie [1]	12.78	1.0	10.2
Sincan [56]	18.8	1.3	8.9
Lost in Translation [29]	12.78	2.6	15.6
SignBind-LLM (Ours)	27.9	6.8	24.9

Table 3. Table 3 : Letter accuracy results on ChicagoFSWildPlus. SignBind-LLM outperforms all previous approaches, improving over the strongest prior model by 2.1 percentage points.

Model	Letter Accuracy
FS Recognition [53]	41.2%
Iterative Attention [54]	46.7%
Pannattee et al. [44]	48.0%
Gajurel et al. [16]	48.4%
TDC-SL [45]	50.0%
FS Attention [30]	57.8%
FSS-Net [55]	64.4%
CtoML [34]	54.9%
FS PoseNet [13]	71.1%
SignBind-LLM (Ours)	73.2%

Table 4. Table 4 : Comparison of Letter Accuracy and BLEU-4 score when different components are removed from the model. This shows that the components that have the largest effect on model performance are the Fusion Transformer and the LLM, this is because the Fusion transformer has the largest effect on the pseudo-gloss outputs and the LLM is needed to fill in the blanks and reorder these outputs into spoken English, which is important for the B-4 metric.

Model Variant	L-Acc. (%)	B-4
W/o Lipreading	54.3	13.6
W/o Fingerspelling	33.7	18.9
W/o Fusion	49.8	8.6
W/o Sequencer	51.2	14.3
W/o LLM	58.8	7.8
Full Model	73.2	22.1

Table 5. Table 5 : Ablation results showing Phoneme accuracy (Pho) and BLEU-4 performance of individual input modalities on the How2Sign dataset.

Model	Pho.-Acc. (%)	B-4
Continuous Sign	-	6.3
Lipreading	65.6	-

Table 6. Table 6 : Comparison of Letter Accuracy and BLEU-4 scores for different LLMs (Parameters in billions) tested with our model. The model with the worst performance is the 1 Billion parameter model with a drop in Letter Accuracy of 12.3 and BLEU of 10.

Model	Params (B)	L-Acc. (%)	B-4
Llama 3.2-1B	1.0	60.9	12.1
Llama 3.2-3B	3.0	64.8	16.4
Llama 2-7B	7.0	73.2	22.1

Table 7. Table 7 : Effect of temporal alignment between manual and non-manual streams. The learned alignment yields the highest Letter Accuracy and BLEU-4, while large positive shifts (e.g., +10) degrade performance the most.

Variant	L-Acc. (%)	B-4
No shift	68.3	17.2
$Δ + 5$	62.8	12.4
$Δ - 5$	70.6	18.1
$Δ + 10$	55.7	7.8
$Δ - 10$	64.9	15.6
Learned alignment	73.2	22.1

Table 8. Table 8 : Fusion Architecture Ablation. Comparison of different fusion mechanisms for combining manual and non-manual signals. Cross-Attention improves upon naive concatenation and is equivalent to gated fusion but introduces substantial computational overhead ( O ( T 2 ) O(T^{2}) vs. O ( T ) O(T) ) without matching our lightweight gated approach.

Fusion Method	BLEU-4
Concatenation + MLP	12.4
Cross-Attention Fusion	22.3
Gated Fusion	22.1

Table 9. Table 9: Data Generalization and Pre-training Analysis. Training strategy comparison on How2Sign. YT-ASL [ 61 ] uses pretrained T5 (pt) or trains from scratch (np). Our staged pre-training approach performs better in both the zero-shot and H2S only training setups.

Method	YT-ASL	H2S	B-4
\rowcolorgray!20 Zero-Shot Transfer
YT-ASL (np) [61]	$✓$	$\times$	1.41
YT-ASL (pt) [61]	$✓$	$\times$	3.95
SignBind-LLM	$✓$	$\times$	8.3
\rowcolorgray!20 H2S Only
Álvarez et al. [11]	$\times$	$✓$	2.21
GloFE-VN [37]	$\times$	$✓$	2.24
YT-ASL (np) [61]	$\times$	$✓$	0.86
YT-ASL (pt) [61]	$\times$	$✓$	1.22
Tarrés et al. [58]	$\times$	$✓$	8.03
SignBind-LLM	$\times$	$✓$	13.7
\rowcolorgray!20 Joint Training
YT-ASL (np) [61]	$✓$	$✓$	5.60
YT-ASL (pt) [61]	$✓$	$✓$	11.89
\rowcolorgray!20 Staged Pre-train $\to$ Fine-tune
YT-ASL (np) [61]	$✓$	$✓$	6.26
YT-ASL (pt) [61]	$✓$	$✓$	12.39
SignBind-LLM	$✓$	$✓$	22.1

Table 10. Table 10: Pseudo-Gloss Generation Examples. GPT-4o effectively compresses English sentences into ASL-ordered pseudo-glosses, removing function words and reordering to SOV structure where appropriate. This compression reduces average sentence length by 30-40%, making CTC alignment more tractable.

Original English Sentence	Generated Pseudo-Gloss
So here we’ve got the startings of our bon fire.	HERE WE START OUR FIRE
We’re going to measure it and there you can see we have it measured.	GO MEASURE IT YOU SEE WE HAVE IT MEASURED WE
In my case I work more from home, and I work more from the college here that I cover, than I do actually at the office.	CASE I WORK MORE HOME I WORK MORE COLLEGE HERE I COVER I ACTUALLY OFFICE
I have a few different ones here.	HAVE DIFFERENT HERE
I have here four different travel cases for your rat.	HAVE HERE FOUR TRAVEL CASE YOUR RAT I

Table 11. Table 11: Fingerspelling Label Examples. Words from ChicagoFSWild+ are split into character sequences for CTC loss computation. This character-level supervision enables the model to learn the 26-letter ASL manual alphabet.

Original Label	Character Sequence
Bills	B I L L S
political capital	P O L I T I C A L C A P I T A L
april	A P R I L
laurene simms	L A U R E N E S I M M S
modalities	M O D A L I T I E S

Table 12. Table 12: Lipreading Label Examples. Pseudo-glosses are phonemized using CMUdict to create training targets for the lipreading expert. Phonemes are represented in ARPAbet notation (lowercase for consistency with our tokenization).

Pseudo-Gloss	Phoneme Sequence
HAVE MY OVER FLOW	hh ae v m ay ow v er f l ow
JUG COOLANT	jh ah g k uw l ah n t
HAVE FEW CLEAN HOW	hh ae v f y uw k l iy n hh aw
OUR TOOLS FEW	aw er t uw l z f y uw
HAVE DIFFERENT HERE	hh ae v d ih f er ah n t hh iy r
GO MEASURE IT	g ow m eh zh er ih t
YOU SEE WE HAVE	y uw s iy w iy hh ae v
IT MEASURED WE	ih t m eh zh er d w iy

Table 13. Table 13: Full Model Architecture Comparison. Parameter counts (Millions) and BLEU-4 scores on How2Sign. VE = Visual Encoder, LM = Language Model. ≈ \approx indicates estimated values based on reported backbone architectures in the original papers. Methods without a BLEU-4 score either do not evaluate on How2Sign or report only other metrics. Our approach uses substantially more visual encoder parameters (197.5M) than prior work due to the three parallel expert streams, but achieves a corresponding 6.6-point improvement over the next-best method.

Method	VE Name	VE Params	LM Name	LM Params	Total	B-4
		(M)		(M)	(M)
MSLU [71]	EffNet	5.3	mT5-Base	582.4	587.7	2.4
SLRT [7]	EffNet	5.3	Transformer	$\approx$ 30	$\approx$ 35.3	-
GASLT [66]	I3D	13.0	Transformer	$\approx$ 30	$\approx$ 43.0	-
GFSLT-VLP [70]	ResNet18	11.7	mBart	680	691.7	-
Sign2GPT [63]	DinoV2	21.0	XGLM	1732.9	1753.9	-
SignLLM [21]	ResNet18	11.7	LLaMA-7B	6738.4	6750.1	-
C²RL [9]	ResNet18	11.7	mBart	680	691.7	9.4
FLa-LLM [10]	ResNet18	11.7	mBart	680	691.7	9.7
Uni-Sign [35]	EffNet+GCN	9.7	mT5-Base	582.4	592.1	14.9
Geo-Sign [14]	GCN+Geo+Attn	6.7	mT5-Base	582.4	589.1	15.1
\rowcolorgray!20 Our Component Breakdown:
Continuous Sign Expert	DinoV2	101.1	-	-	101.1	-
Fingerspelling Expert	DinoV2 (shared)	0 (shared)	-	-	0	-
Lipreading Expert	ViT	62.5	-	-	62.5	-
Fusion Transformer	Attn	33.9	-	-	33.9	-
SignBind-LLM (Ours)	DinoV2+ViT	197.5	LLaMA-7B	6738.4	6935.9	22.1

Table 14. Table 14: Expert Pre-training Hyperparameters. All three expert networks (Continuous Sign, Fingerspelling, Lipreading) share the same training configuration except for the number of epochs and vocabulary size.

Hyperparameter	Value
Optimizer	AdamW [31]
$β_{1}, β_{2}$	0.9, 0.98
$ϵ$	$10^{- 8}$
Learning Rate Schedule	Cosine Annealing
Initial LR	$10^{- 4}$
Minimum LR	$10^{- 6}$
Warmup Steps	1,000
Weight Decay	0.01 (exclude biases, layer norms)
Gradient Clipping	Max norm = 1.0
Dropout	0.1 (all projections & attention)
Batch Size	8 clips
Sequence Length	Variable (16–512 frames)
Epochs (Lipreading)	100
Epochs (Sign, Fingerspelling)	30
Loss Function	CTC

Table 15. Table 15: LLM Fine-tuning Configuration. LoRA enables efficient adaptation of LLaMA-2-7B with minimal memory overhead. We apply rank-16 adapters to all attention and MLP projections.

Hyperparameter	Value
Base Model	LLaMA-2-7B-hf
Adaptation Method	LoRA [27]
LoRA Rank $r$	16
LoRA Scaling $α$	32 ( $α / r = 2.0$ )
LoRA Dropout	0.05
Target Modules	$W_{q}, W_{k}, W_{v}, W_{o}$ (attn)
	$W_{gate}, W_{up}, W_{down}$ (MLP)
Optimizer	AdamW
$β_{1}, β_{2}$	0.9, 0.999
Learning Rate	$2 \times 10^{- 4}$
Warmup Ratio	0.03
Gradient Clipping	Max norm = 1.0
Batch Size	1
Gradient Accumulation	4 steps (effective batch = 4)
Sequence Length	512 tokens
Epochs	10

Table 16. Table 16: Computational Requirements. Training time and memory usage for each stage on a single NVIDIA A100 (80GB). The visual encoder operates at real-time speeds, while LLM decoding remains the primary inference bottleneck.

Stage	Time	Peak Memory
Expert Pre-training	72 hours	68 GB
Fusion Training	72 hours	43 GB
LLM Fine-tuning	24 hours	78 GB
Total Training	240 hours	78 GB

Table 17. Table 17: Qualitative examples of our multi-modal translation method on the How2Sign dataset. For each example, we show the outputs of the three expert predictors (Continuous Sign, Fingerspelling, Lip Reading), the fused pseudo-gloss sequence, and the final LLM-generated sentence.

Modality / Component	Output / Transcription
Continuous Sign	HAND WALL TREE FEEL WALL LOW YOU WHAT ALL LOOK
Fingerspelling	—
Lip Reading (Phonemes)	t uw ey n m ao l n s aa n y ao r t ow s ay d ao l y ae v t uw d uw ih z s eh n t er y r sh ow er z b ay l uh k ih ng ah p dh ah m n t ah m b ae k w eh r y uw k ey m f r ah
Fused	GAIN BALANCE ON TOE YOU HAVE CENTER YOU SHOULDERS LOOK UP MOUNTAIN YOU CAME FROM WHERE
Our Method (Combined)	So to gain that balance on your toe, all you have to do is center your shoulders by looking to the mountain. This is where you came from.
Reference	To gain balance on your toe side, all you have to do is center your shoulders by looking up the mountain, back where you came from.
Continuous Sign	POOR FLY HOUSE YOU ME DOOR
Fingerspelling	—
Lip Reading (Phonemes)	r ay t dh m d ao z t n dh eh m g ow t dh ah s t r b ay m aa z s t ae ao
Fused	RIGHT DOWN GO STORE BUY ME
Our Method (Combined)	So write them down and then go to the store and buy them for me.
Reference	Write them down, then go to the store.
Continuous Sign	WATCH MAKE
Fingerspelling	X P
Lip Reading (Phonemes)	s m n z w aa r t uw dh t ay t s ah m jh iy n s aa r n aa t t ay t ih n ah f
Fused	SOME TIGHT SOME NOT TIGHT
Our Method (Combined)	Some of them are too tight, some are not tight enough.
Reference	Some jeans are too tight, some jeans are not tight enough.
Continuous Sign	I WORK EYE BLUE
Fingerspelling	—
Lip Reading (Phonemes	t d ey m iy r g oo ih g t eh w er k aa n z t r eh ih ng n d s r eh th p ih ng m ah l ow r b aa d iy
Fused	TODAY WE WORK LOWER HOUSE
Our Method (Combined)	Today we’re working on the lower part of the house, the basement.
Reference	Today we’re going to work on stretching and strengthening the lower body.
Continuous Sign	FRIEND BAG HAT YELLOW ONE WHERE
Fingerspelling	—
Lip Reading (Phonemes	k dh ih s aa b v s l k ae ao l ow b iy y uw sh d ae z ih w ao r m ah p t uw l w eh n y eh r f er s t ih n ih k ah l iy y ah z ng ah b g
Fused	CAN USED AS WARM UP TOOL WHEN YOU FIRST INITIALLY USING BAG
Our Method (Combined)	This can be used as a warm up tool when you first are using the bag.
Reference	This obviously can also be used as a warm up tool when you’re first initially using a bag.
Continuous Sign	WHAT SHOUT
Fingerspelling	J I M D W D
Lip Reading (Phonemes	z iy ah eh m jh k m d aw d w ih dh r o a r aw d r
Fused	I JIM DOWN WITH ROAR
Our Method (Combined)	So this is Jim Down with Roar zoo.
Reference	I’m Jim Dowd with Zoar Outdoor.

Table 18. Table 18: Qualitative translation examples of our multi-modal translation method on BOBSL. Representative examples demonstrating the model’s performance on British Sign Language. The multi-stream architecture generalizes effectively from ASL to BSL despite grammatical and lexical differences.

Modality / Component	Output / Transcription
Continuous Sign	OAR
Fingerspelling	R E I T S L
Lip Reading (Phonemes)	ah dh r ah m t ay l z w ay l
Fused	OTHER REPTILES
Our Method (Combined)	Other reptiles include…
Reference	Other reptiles as well.
Continuous Sign	GOOD BIKE BEACH
Fingerspelling	—
Lip Reading (Phonemes)	b r ah r b ah l hh y uw m n ah n z d uw l f ah n s f eh
Fused	TROUBLE HUMANS DOLPHINS
Our Method (Combined)	Humans often cause trouble for dolphins when out at sea
Reference	The trouble is, humans are not built like dolphins.
Continuous Sign	MADE HELLO APPLE RED
Fingerspelling	—
Lip Reading (Phonemes)	ih t s n uh t l ow k m w iy k z eh r ah b f aw t ih t
Fused	NOT LIKE CARE IT
Our Method (Combined)	It’s not like we care about it.
Reference	It’s not a crazy theory.
Continuous Sign	WE FAR TREE
Fingerspelling	—
Lip Reading (Phonemes)	n ao t b er s t ih g r iy d iy ah n t s z er v ih b ah l
Fused	NOT SURVIVE
Our Method (Combined)	We’re not likely to survive this.
Reference	Not the best ingredients for survival.
Continuous Sign	CATS BIG HOUSE OUT ME 5 GOOD
Fingerspelling	—
Lip Reading (Phonemes)	m ah n f r ch n t l f ae t s b s t w z ey iy p ah b iy t ah p eh r t ah l v ao t ey k ih t aw t ih m k w ey zh ah n
Fused	AN FORTUNATELY CATS BEST KEEP LITTLE PARROT ALIVE TAKE OUT
Our Method (Combined)	Unfortunately the cats are best kept outside to keep the little parrot alive.
Reference	Unfortunately for the cats, the best way to keep this chubby little parrot alive is to take kitty out of the equation.
Continuous Sign	I ROOM TEST CAMERA
Fingerspelling	—
Lip Reading (Phonemes)	m ay s ey aw v er r p ah ah z s m er n t
Fused	ME SAY RAPID ASSESSMENT
Our Method (Combined)	I was saying this is our rapid assessment area.
Reference	As I was saying, this is our rapid assessment bay.

Equations48

F : X

F : X

\to G_{fused} \to S

G = LLM_{gloss} (S) = {g_{1}, \dots, g_{M}}

G = LLM_{gloss} (S) = {g_{1}, \dots, g_{M}}

P = Phonemize (G) = {p_{1}, \dots, p_{N}}

P = Phonemize (G) = {p_{1}, \dots, p_{N}}

X_{full}

X_{full}

X_{face}

\hat{c}_{logits} = W_{cls} h_{avg} + b_{cls} \in R^{B \times 3}

\hat{c}_{logits} = W_{cls} h_{avg} + b_{cls} \in R^{B \times 3}

g = GumbelSoftmax (\hat{c}_{logits}, τ) = [g_{sign}, g_{fs}, g_{rest}]

g = GumbelSoftmax (\hat{c}_{logits}, τ) = [g_{sign}, g_{fs}, g_{rest}]

\tilde{H}_{manual}

\tilde{H}_{manual}

\hat{L}_{sign}

\hat{L}_{fs} = \tilde{H}_{manual} W_{fs} + b_{fs} \in R^{B \times T \times 26}

\hat{L}_{fs} = \tilde{H}_{manual} W_{fs} + b_{fs} \in R^{B \times T \times 26}

Z

Z

G

\hat{L}_{lip}

E_{sign}

E_{sign}

E_{fs}

E_{lip}

M = g_{sign} ⊙ E_{sign} + g_{fs} ⊙ E_{fs} + g_{rest} ⊙ n_{null}

M = g_{sign} ⊙ E_{sign} + g_{fs} ⊙ E_{fs} + g_{rest} ⊙ n_{null}

α

α

H_{fused}

Z = TransformerEncoder (H_{fused}) \in R^{B \times T \times d}

Z = TransformerEncoder (H_{fused}) \in R^{B \times T \times d}

\hat{G} = CTC-Decode (Softmax (W_{out} Z + b_{out}))

\hat{G} = CTC-Decode (Softmax (W_{out} Z + b_{out}))

L_{m} = CTC (L_{m}, T_{m}), m \in {fs, sign, lip}

L_{m} = CTC (L_{m}, T_{m}), m \in {fs, sign, lip}

L_{fusion} = CTC (\hat{G}_{t}, Y_{t})

L_{fusion} = CTC (\hat{G}_{t}, Y_{t})

L_{m} = - lo g π \in A (T_{m}) \sum t = 1 \prod T p_{m} (π_{t} ∣ t)

L_{m} = - lo g π \in A (T_{m}) \sum t = 1 \prod T p_{m} (π_{t} ∣ t)

L_{fusion} = - lo g π \in A (G) \sum t = 1 \prod T softmax (W_{out} Z + b_{out})_{t, π_{t}}

L_{fusion} = - lo g π \in A (G) \sum t = 1 \prod T softmax (W_{out} Z + b_{out})_{t, π_{t}}

input = prompt (masked from loss) <S2S> <GLOSS> g </GLOSS> <TXT> target sentence s

input = prompt (masked from loss) <S2S> <GLOSS> g </GLOSS> <TXT> target sentence s

L_{LLM} = - t = t_{start} \sum t_{end} lo g p_{θ} (s_{t} ∣ g, s_{< t})

L_{LLM} = - t = t_{start} \sum t_{end} lo g p_{θ} (s_{t} ∣ g, s_{< t})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Natural Language Processing Techniques · Hearing Impairment and Communication

Full text

SignBind-LLM: Multi-Stage Modality Fusion for Sign Language Translation

Marshall Thomas Edward Fish Richard Bowden

CVSSP, University of Surrey

Guildford, Surrey, United Kingdom

{marshall.thomas, edward.fish, r.bowden}@surrey.ac.uk

Abstract

Despite progress in gloss-free Sign Language Translation (SLT), traditional single modality end-to-end approaches consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in SLT with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names, places, and technical terms. We introduce SignBind-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight transformer that resolves temporal misalignments before passing the combined representation to a Large Language Model (LLM) for final sentence generation. Our method establishes a new state-of-the-art on the How2Sign, ChicagoFSWildPlus, and BOBSL datasets with a BLEU-4 score of 22.1, 73.2% letter accuracy and BLEU-4 score of 6.8 respectively. These results validate our core hypothesis: isolating and solving distinct recognition tasks before fusion provides a more powerful and effective pathway to robust, high-fidelity sign language translation.

1 Introduction

Sign languages are fully-fledged natural languages used by millions of Deaf people worldwide. Sign languages rely on manual gestures (handshapes, movements, and locations) and non-manual signals (facial expressions, mouthings, and mouth gestures111In sign language linguistics, mouthing denotes silent articulation of spoken words that accompany signs, while mouth gestures are non-verbal movements integral to the sign itself, conveying grammatical or affective nuance). Languages like American Sign Language (ASL), British Sign Language (BSL), and German Sign Language (DGS) [25, 5, 19] each possess their own lexicon, grammar, fingerspelling method, and mouthings.

Early methods focused on Isolated Sign Language Recognition (ISLR), typically required expertly annotated glosses and details English translations for individual signs [6]. While the advent of large, gloss-free datasets (e.g., YouTube-ASL [61], Public DGS [26], How2Sign [11]) and the integration of large language models (LLMs) [37] have moved the field toward end-to-end Sign Language Translation (SLT), two pervasive challenges continue to limit real-world performance:

Fingerspelling: Fingerspelling is a critical component of all sign languages and is often used for proper names, technical terms, and loanwords. This constitutes a significant portion of signing (e.g., 12–35% of ASL [62]). However, current gloss-free SLT systems [36, 39], treat it as a subordinate task within the main translation pipeline while other SOTA methods ignore it entirely [35, 14]. As a result, fingerspelled terms are frequently mistranslated.

Non-Manual Cues: Mouthings and facial expressions carry crucial disambiguating information, akin to phonemes or visemes in speech [12, 43]. Yet, existing SLT models rarely utilize these powerful signals. The key challenge in integrating these features in end-to-end approaches is that they are not aligned with manual signs, particularly during fingerspelling.

In contrast to advances in Automatic Speech Recognition (ASR) [38] and Visual Speech Recognition (VSR) [59], which leverage massive pretraining and powerful architectures, sign language systems remain constrained. This discrepancy is primarily due to (a) data scarcity; (b) the spatio-temporal complexity of co-articulated gestures; and critically, (c) the unaddressed temporal offset between these manual and non-manual signals.

To overcome these fundamental limitations, we introduce SignBind-LLM, a novel gloss-free SLT framework built on a paradigm of multi-modal, parallel-stream processing. The novelty lies in its dedicated, three-pronged architecture designed to explicitly solve the issue of non-manual misalignment, temporal variance, fingerspelling recognition, and the integration of lipreading. We make the following contributions:

Dedicated Modality Streams: We implement three specialized processing streams. One for continuous sign, one dedicated sub-network for fine-grained fingerspelling detection, and one dedicated visual speech recognition (lipreading) component which are pre-trained independently for their respective tasks.

Asynchronicity-Aware Fusion Encoder: We introduce a lightweight transformer encoder equipped with a learnable gating mechanism that dynamically weighs the contribution of each visual stream before temporal integration. The Fusion Encoder adaptively balances information from the active modality (i.e signing or fingerspelling) and the lipreading branch, enabling the model to emphasize the most informative cues under varying temporal asynchrony.

Contextual LLM Decoding: We train a specialised LLM for decoding fused representations in sign gloss order to English sentences, generating contextually accurate and fluent translations without relying on any direct gloss supervision. We further provide LLM generated pseudo-gloss annotations and phonemes for the 1.5M sentences in BOBSL [1] and 35k in How2Sign [11].

We demonstrate the efficacy of this new paradigm on How2Sign, ChicagoFSWildPlus, and BOBSL, with large improvements over SOTA end-to-end approaches ( $+6.6$ B4 on How2Sign and $+4.2$ B4 on BOBSL). Our findings demonstrate that combining modality-specific, parallel predictors with an LLM decoder is the scalable path forward for high-quality translation of real-world signed content.

2 Related Work

2.1 Sign Language Understanding

Sign Language Understanding (SLU) has evolved through three increasingly ambitious tasks:

Isolated Sign Language Recognition (ISLR): Early work focused on recognizing individual signs in controlled settings. Methods such as 3D CNNs over spatiotemporal volumes [41] and hybrid CNN–HMM pipelines [32] achieved high accuracy on small vocabularies (e.g., 100–500 signs), but relied on carefully segmented clips and extensive manual gloss annotations.

Continuous Sign Language Recognition (CSLR): To scale beyond isolated signs, CSLR methods adopted sequence‐level training with Connectionist Temporal Classification (CTC) loss [23]. Recurrent architectures (CNN–BiLSTM–CTC) [22] and, more recently, Transformer encoders [41] enabled end‐to‐end gloss prediction from unsegmented video. While these models tolerate variable sign durations, they still depend on gloss‐level transcripts which are costly to annotate. Furthermore CSLR methods often struggle with out‐of‐vocabulary signs and fingerspelling.

Sign Language Translation (SLT): SLT goes one step further by generating natural language sentences directly from sign video. SLT methods can largely be categorised into two groups. Gloss‐based SLT pipelines [68, 7, 67] utilise sing-aligned annotated data to predict gloss sequences as in the CSLR case above. These can then be used as input via a sequence‐to‐sequence model [6] to improve translation quality. Examples include CNN–LSTM–attention frameworks that achieve fluent translation but inherit errors from the gloss recognizer. Alternatively, Gloss‐free SLT [10, 71, 39, 21, 3] bypasses intermediate glosses, learning a direct video‐to‐text mapping. Gloss-free methods have the advantage that they can leverage large datasets without accurate and aligned gloss-level translations. This enables large‐scale video–text pretraining [70, 37, 71, 58] and multimodal contrastive objectives [24] to improve translation fluency. More recent approaches have focussed on leveraging LLM’s language modelling capabilities to incorporate sign features either from RGB [61], Pose [14], or a fusion of both modalities [35, 63]. However, these models often underperform on fingerspelling and non‐manual cues, since they attempt to learn all visual subtasks in a single end‐to‐end network.

2.2 Lipreading and Visual Speech Recognition

Lipreading, or visual speech recognition (VSR), can complement sign language understanding by capturing silent speech cues. Modern systems employ deep learning-based architectures. LipNet [50] introduced an end-to-end model using spatiotemporal CNNs [33, 8] and BiLSTMs [22], achieving state-of-the-art sentence-level lipreading performance. Transformer-based models, such as Lipformer [64], further improved contextual understanding across video frames. Recent advances leverage self-supervised learning and multimodal fusion to enhance lipreading robustness. AV-HuBERT [52] demonstrated the efficacy of large-scale audiovisual pretraining, achieving strong performance under challenging conditions [59]. However, lipreading remains underexplored in the context of fingerspelling and full‐sentence SLT, despite mouthing cues carrying up to 40% of linguistic information in natural sign production (e.g., lexical disambiguation, grammatical markers) [2].

2.3 Multimodal Approaches

Multimodal learning has emerged as a promising approach to improve SLT by integrating visual, linguistic, and contextual information [29]. Previous systems fused RGB video, pose sequences, and optical flow [14] to enhance recognition accuracy. Recent works such as GFSLT-VLP [70] incorporated video-text contrastive learning and cross-modal attention mechanisms to bridge the gap between sign language and natural language representations.

However, current fingerspelling detection methods remain focused on hand centric cues and fail to leverage the rich information present in lip movements. Approaches such as synthetic data augmentation [15], large “in‐the‐wild” collections [69, 20], and pose‐based models achieve moderate accuracy on isolated letter streams but do not integrate mouthings. This design overlooks the well‐documented co‐articulation between handshapes and lip movements in natural signing [2], leading to persistent errors on homonymous handshapes (e.g., “M” vs. “N”) and proper‐noun detection. To date, no SLT system has explicitly fused high‐resolution fingerspelling and lipreading streams while accounting for their temporal misalignment.

2.4 LLMs in SLT and VSR

The rise of Large Language Models (LLMs) has revolutionized sign language understanding, fingerspelling detection, and lipreading by leveraging vast linguistic knowledge and multimodal learning capabilities. In SLT, LLMs have been employed to bridge the gap between sign video sequences and natural language text [14, 35]. Newer approaches [48] applied transformer-based architectures to generate coherent text from sign videos, outperforming traditional gloss-based approaches. For fingerspelling detection, LLMs have facilitated cross-modal learning by aligning hand gestures with corresponding textual representations [57, 51]. These approaches demonstrated that LLMs can effectively enhance fingerspelling recognition by integrating linguistic and visual modalities but they have not previously been combined with lipreading.

In lipreading, LLMs have been integrated into audiovisual speech recognition pipelines to improve contextual understanding and transcription accuracy [59]. The first approach to utilize LLM-based encoders [65] enhanced lipreading robustness, even under challenging conditions.

Despite progress in each of these individual modality streams and the recent availability of large sign language datasets and language models, no-one has yet proposed a method which combines lipreading, fingerspelling, and sign recognition in a unified framework for SLT.

3 Method

We propose SignBind-LLM, a novel sign language translation framework that decomposes the complex translation task into specialized sub-tasks before fusion. Our key insight is that sign languages employ distinct communication channels—continuous signing, fingerspelling, and mouthings—that benefit from dedicated modeling before integration. We achieve this through a four-stage pipeline: (1) Target Generation via Text Pre-processing, (2) Modality-Specific Pre-training, (3) Multi-modal Fusion, and (4) Language Model Refinement.

Figure 2 illustrates our complete architecture.

3.1 Problem Formulation

Given an input video sequence of Sign Language $\mathbf{X}=\{x_{1},\ldots,x_{T}\}$ , where $x_{t}\in\mathbb{R}^{H\times W\times 3}$ represents frame $t$ , our primary objective is to produce an English translation $\mathbf{S}=\{s_{1},\ldots,s_{K}\}$ . We formally decompose this mapping, $\mathcal{F}$ , into a series of specialized functions:

[TABLE]

Here, $\mathcal{F}_{\text{sign}}$ , $\mathcal{F}_{\text{fs}}$ , and $\mathcal{F}_{\text{lip}}$ denote the expert functions for continuous signing, fingerspelling, and lipreading. Their outputs are integrated into a unified pseudo-gloss representation $\mathbf{G}_{\text{fused}}$ , which is finally translated into the target sentence $\mathbf{S}$ .

3.2 Stage 1: Target Generation via Text Pre-processing

A significant barrier to multi-stream modeling is the lack of parallel, granular annotations. We overcome this by automatically generating three complementary training targets directly from English subtitles $\mathbf{S}=\{w_{1},\ldots,w_{L}\}$ .

Pseudo-Gloss Generation: To create a compact and linguistically relevant target for the sign recognition task, we employ GPT-4o [47] to transform English sentences into ASL-ordered pseudo-glosses. This process removes function words and reorders content to match common signing patterns.

[TABLE]

Here, $M\leq L$ reflects the removal of semantically redundant elements such as articles, and the re-ordering to subject, object, verb structure.

Phoneme Extraction: Mouthings provide critical disambiguation for visually similar signs. To supervise our lipreading module, we extract phoneme sequences from the generated pseudo-glosses:

[TABLE]

Each $p_{i}$ represents an English phoneme, enabling our model to learn the fine-grained articulatory motions of mouthed words.

Fingerspelling Identification: For the Fingerpelling labels we use the ChicagoFSWild+ dataset and split the words into individual character sequences such that $\mathbf{F}=\{f_{1},\ldots,f_{n}\}$ . Each $f_{i}$ represents a character in the word label, enabling the model to learn each letter in the alphabet. We also isolate sequences that contain potential fingerspelling from the psuedo-glosses by applying rule-based detection to $\mathbf{G}$ . We do this by identifying proper nouns and technical terms, which are typically fingerspelled.

3.3 Stage 2: Modality-Specific Encoders

This stage defines the “expert” networks that learn to recognize each specific communication channel.

Video Pre-processing: We process the input video into two distinct streams to feed the appropriate experts. We use MediaPipe [40] to extract both the full frames $\mathbf{X}_{\text{full}}$ (for manual signs) and tightly-cropped face frames $\mathbf{X}_{\text{face}}$ (for lipreading).

[TABLE]

Both streams are normalized with a resolution of $224\times 224$ , where $B$ is the batch size.

Shared Backbone and Manual Experts: The core of our visual encoder is a single DINOv2 backbone [42] that processes the full-frame stream from $\mathbf{X}_{\text{full}}$ . The features extracted from this backbone, $\mathbf{H}_{\text{manual}}$ , are shared, serving as the input for three independent parallel heads.

Dynamic Sequence Routing: A key innovation of our work is a dynamic routing module that operates on the shared backbone features $\mathbf{H}_{\text{manual}}$ . We find that applying all experts to all frames is inefficient and introduces noise and so we train a lightweight classifier to route segments to the correct expert: $\{\text{Sign},\text{Fingerspelling},\text{Rest}\}$ .

First, we apply temporal-average pooling to the DINOv2 features $\mathbf{H}_{\text{manual}}$ to get a sequence-level representation, A small MLP head then processes this feature to produce class logits.

[TABLE]

To maintain differentiability for training, we use the Gumbel-Softmax [28] reparameterization trick on the raw logits to obtain a one-hot-like selection vector $\mathbf{g}$ :

[TABLE]

where $\tau$ is the temperature parameter. This routing vector is passed to the fusion stage (Stage 3) to dynamically weight the outputs of the expert heads.

Continuous Sign Recognition: In parallel with the router, the continuous signing expert processes the full, unpooled DINOv2 features. We add positional embeddings ( $\mathbf{PE}$ ) and pass the features through a dedicated linear head to project to the sign vocabulary size $|\mathcal{V}_{\text{sign}}|$ :

[TABLE]

Fingerspelling Detection: Similarly, the fingerspelling expert processes the same position-encoded features, $\tilde{\mathbf{H}}_{\text{manual}}$ , but uses its own dedicated linear head to produce logits $\mathbf{\hat{L}}_{\text{fs}}$ over the 26 characters of the alphabet:

[TABLE]

Lipreading Module: The lipreading expert processes the face-cropped stream $\mathbf{X}_{\text{face}}$ . We employ a masked ViT ( $\text{ViT}_{\text{lip}}$ ) with a 50% masking ratio to force the model to learn robust representations. A 1D convolution then performs temporal adaptation, and a final linear layer projects to the phoneme vocabulary $|\mathcal{P}|$ :

[TABLE]

Note that the temporal dimension $T^{\prime}<T$ due to the convolutional pooling.

3.4 Stage 3: Temporal-Aware Multi-modal Fusion

This stage addresses the core challenge of fusing our asynchronous experts. First, we project all outputs to a common dimension and align their temporal lengths. The lipreading logits $\mathbf{L}_{\text{lip}}$ are upsampled from $T^{\prime}$ to $T$ :

[TABLE]

Gated Manual Feature Aggregation: We use the routing vector $\mathbf{g}$ from Stage 2 to create a single, unified manual representation $\mathbf{M}$ . This dynamically selects the correct expert output (sign or fingerspelling) for each segment.

[TABLE]

Here, $\odot$ denotes element-wise multiplication, and $\mathbf{n}_{\text{null}}$ is a learned null vector to represent rest periods.

Adaptive Feature Gating: The relative importance of manual signs versus mouthings is context-dependent. We learn this balance with an adaptive gating mechanism. A learned gate $\boldsymbol{\alpha}$ controls the information flow from the manual stream $\mathbf{M}$ and the lip stream $\mathbf{E}_{\text{lip}}$ :

[TABLE]

where $[\cdot;\cdot]$ denotes feature concatenation.

Transformer-based Temporal Modelling: The resulting fused features $\mathbf{H}_{\text{fused}}$ are passed through a final TransformerEncoder to model long-range temporal dependencies, producing the final contextual representation $\mathbf{Z}$ :

[TABLE]

This fusion module is trained to produce the correct pseudo-gloss sequence $\hat{\mathbf{G}}$ via a final CTC loss:

[TABLE]

3.5 Stage 4: Language Model Refinement

The fused representation $\hat{\mathbf{G}}$ is a sequence of pseudo-glosses, which is not representative of spoken English. The final stage employs a language model, $\text{LLM}_{\theta}$ , to translate this intermediate representation into the final sentence $\hat{\mathbf{S}}$ where $\hat{\mathbf{S}}=\text{LLM}_{\theta}(\hat{\mathbf{G}})$ This model is trained independently of the visual pipeline. We pre-train $\text{LLM}_{\theta}$ on a large corpus of (pseudo-gloss, English sentence) pairs $(\mathbf{G},\mathbf{S})$ generated in Stage 1. This decoupled approach allows the LLM to learn a robust linguistic mapping that generalizes well to the noisy gloss outputs $\hat{\mathbf{G}}$ produced by the visual model at inference time.

3.6 Training Strategy

Our curriculum is strictly staged to ensure stable convergence and component specialization. We do not use a single, joint end-to-end loss. Instead, each module is trained independently, with its parameters frozen before being used by the next stage. This process is applied in two phases: pre-training and fine-tuning.

Phase 1: Stage-wise Pre-training: We first pre-train each component on large-scale, diverse datasets (e.g., YouTube-ASL [61], ChicagoFSWild+).

Train Experts: The Sequence Classifier $\mathcal{L}_{\text{cls}}$ , Sign Encoder $\mathcal{L}_{\text{sign}}^{\text{CTC}}$ , FS Encoder $\mathcal{L}_{\text{fs}}^{\text{CTC}}$ , and Lipreading module $\mathcal{L}_{\text{lip}}^{\text{CTC}}$ are all trained separately on their respective targets from Stage 1. The losses for this stage are defined as:

[TABLE]

where $L_{m}$ are the output logits from branch $m$ and $T_{m}$ the corresponding ground truth token sequence. 2. 2.

Freeze Experts & Train Fusion: After the experts are trained, their weights are frozen. The Fusion Module is then trained on the frozen outputs of the experts to optimize $\mathcal{L}_{\text{fusion}}$ . The loss for the fusion is defined as:

[TABLE] 3. 3.

Train LLM: Separately, the Language Model is pre-trained on all available $(\mathbf{G},\mathbf{S})$ text pairs to optimize $\mathcal{L}_{\text{LLM}}$ using standard Cross Entropy Loss.

**Phase 2: Stage-Wise Fine-tuning: ** We repeat the same staged process on the smaller, high-quality fine-tuning dataset (e.g., How2Sign [11]).

Fine-tune Experts: The pre-trained experts are fine-tuned on the How2Sign data, again using their independent losses. 2. 2.

Freeze Experts & Fine-tune Fusion: The fine-tuned expert weights are frozen. The pre-trained Fusion Module is then fine-tuned on their outputs. 3. 3.

LLM: The LLM is already fully trained and remains frozen during this phase.

This decoupled, staged methodology ensures that each component becomes a robust expert at its specific task before its outputs are used to train subsequent modules.

4 Experiments

4.1 Datasets

For ASL we pre-trained our model using the Youtube-ASL dataset and ChicagoFSWild+, then fine-tuned it on How2Sign. We then evaluated our proposed framework on How2Sign and ChicagoFSWild+. For BSL we trained and evaluated our model on BOBSL [1].

YouTube-ASL [61]: A large-scale ASL dataset comprising 60K videos and 1,000 hours sourced from YouTube, annotated with sentence-level transcriptions. This dataset provides diverse signing conditions, including variations in background, lighting, and signer appearance. The dataset features over 2,500 unique signers, ensuring that the model generalizes well across different users.

ChicagoFSWildPlus [4]: A dataset containing 55,232 ASL finger-spelling sequences performed by 260 signers containing in-the-wild videos of fingerspelling sequences. It includes multiple signers, varied environments, and occlusions, which makes it ideal for evaluating the robustness of fingerspelling detection.

How2Sign [11]: How2Sign is a large‐scale, continuous ASL corpus derived from 2,456 instructional “How2” videos, totaling over 80 hours of footage. Recordings were made in two settings: a Green-Screen studio (79.1 h over 2,529 videos) and the Panoptic studio (2.96 h over 124 videos). Eleven signers (5 hearing ASL interpreters, 2 hard-of-hearing, 4 Deaf) produced 35,191 sentence‐level clips (average 162 frames/5.4 s, 17 words), yielding a vocabulary of over 16,000 English words.

BOBSL [1]: A large-scale video dataset consisting of British Sign Language from BSL interpreted BBC broadcast footage. The data features 39 unique interpreters from 1,962 episodes across 426 TV shows, resulting in approximately 1,467 hours of video content. The videos are annotated with English subtitles approximately 1.2M sentences.

4.2 Evaluation Metrics

We evaluated model performance using several common metrics: Letter Accuracy [18], BLEU Score [46] and ROUGE [17]. For all metrics higher scores demonstrate improved performance.

5 Results

In this section we compare the performance of SignBind-LLM with several state-of-the-art models from recent literature. Table 1 and Table 2 show the comparison between our approach and other approaches in SLT on How2Sign and BOBSL respectively. Table 3 shows the comparison for fingerspelling on the ChicagoFSWildPlus dataset.

5.1 Qualitative Results

Figure 2 shows three example translations, with the outputs from each stream. It aslo shows where the translation process fails and how the fusion transformer helps to remedy this. Finally the figure shows how the LLM generates the spoken-English sentence from the fusion predictions. We observe that the mouthed features provide a strong signal for the fusion encoder, while the LLM is effective at correcting grammatical errors in the translation.

5.2 Ablation Study

In this section, we discuss the different ablation studies performed to demonstrate the contribution of the various components in the architecture.

Model Variants: The first ablation was to understand the contribution of each component and the importance of each modality. We conducted an ablation study by selectively disabling key features of SignBind-LLM and comparing the results. Table 4 presents the results. We identify that the fusion network and LLM are key for improving performance, while lipreading is the most important modality.

Stand-alone Effectiveness: This ablation focuses on the effectiveness of the two primary experts, the continuous sign predictor and the Lipreading predictor. Table 5 shows the phoneme prediction performance and continuous sign prediction performance of the model on the How2Sign dataset from the sign CTC branch directly. We observe that lipreading is a critical component of the network delivering most of the performance improvement during fusion. The low continuous sign score can also be attributed to the misalignment between this branch and the English translation before the LLM reordering.

Varying LLMs: The next ablation was to compare the effectiveness of different LLMs, fully fine-tuned with different parameter sizes. Table 6 shows the results of this study.

Asynchronous Fusion: The penultimate ablation focuses on the effectiveness of the fusion model. The aim is to quantify how sensitive our fusion mechanism is to the temporal misalignment that naturally occurs between hand movements (manual signals) and mouth movements (non‐manual cues) in real sign language. To quantify this we test with four different temporal shifts:

No shift: Directly fuse frame $t$ of lips with frame $t$ of hands, no temporal shift at all 2. 2.

$\Delta\pm$ 5 frames: fuse frame t of hands with lip $t\pm 3$ frames 3. 3.

$\Delta\pm$ 10 frames: fuse frame t of hands with lip $t\pm 5$ frames 4. 4.

Learned Alignment: The model learn an optimal per‐time fusion gating

6 Conclusion

We introduced SignBind-LLM, a modular framework that redefines gloss-free Sign Language Translation through explicit multi-stage fusion of continuous signing, fingerspelling, and lipreading. By decomposing SLT into dedicated expert streams and resolving their temporal asynchrony via a lightweight transformer, our model achieves state-of-the-art results on How2Sign (BLEU-4 of 22.1), BOBSL (BLEU-4 of 6.8) and ChicagoFSWildPlus (73.2% letter accuracy). Our findings validate that isolating and reconciling heterogeneous visual-linguistic cues before fusion leads to SOTA performance on sign language translation.

7 Acknowledgements

This work was supported by the SNSF project ‘SMILE II’ (CRSII5 193686), the Innosuisse IICT Flagship (PFFS-21 47), EPSRC grant APP24554 (SignGPT-EP/Z535370/1) and through funding from Google.org via the AI for Global Goals scheme. This work reflects only the author’s views and the funders are not responsible for any use that may be made of the information it contains. Thank you to Oline Ranum for help with the parts of Speech analysis.

Appendix A Introduction

This supplementary material provides comprehensive technical details and additional ablation experiments for our proposed method.

The document is organized as follows:

•

Appendix B** – Extended Ablation Studies:** Detailed comparisons of fusion architectures and zero-shot generalization experiments, quantifying the benefits of large-scale pre-training.

•

Appendix C** – Part-of-Speech Analysis:** Fine-grained linguistic analysis across 16 POS categories, revealing our model’s strengths in content word prediction and the trade-off between visual fidelity and grammatical fluency.

•

Appendix D** – Implementation Details:** Complete experimental setup including pseudo-glossing pipeline, phoneme extraction, model architecture specifications, training hyperparameters, and computational requirements.

•

Appendix E** – Qualitative Translation Analysis:** Extensive translation examples from How2Sign and BOBSL, showing outputs from each expert stream and demonstrating how the Fusion Encoder resolves ambiguities.

Appendix B Extended Ablation Studies

In the main paper, we demonstrated that our Gated Fusion mechanism achieves state-of-the-art performance. Here, we analyze alternative fusion strategies and the model’s zero-shot generalization capabilities.

B.1 Analysis of Fusion Strategies

As dicussed in the main paper, Sign Language translation faces a unique challenge: temporal asynchrony. The manual sign for a concept often occurs slightly before or after the corresponding mouthing. A phenomenon well-documented in sign language linguistics but rarely addressed in computational models. We hypothesized that a simple concatenation of features would fail to capture this dynamic relationship, and that explicit gating mechanisms would be necessary to learn when to rely on each modality.

To validate this hypothesis, we compared three distinct fusion strategies:

Concatenation + MLP: A naive baseline where visual features from all streams (manual + lip) are concatenated at each timestep and projected back to a common dimension via a two-layer MLP with GELU activation and dropout ( $p=0.1$ ). This MLP serves as a learned mixing function without any explicit attention or content-adaptive weighting. The resulting fused features are then fed directly into the Fusion Encoder. 2. 2.

Cross-Attention Fusion: A standard Transformer-based approach where the manual stream features act as queries and attend over the lipreading features as keys and values. The output of the cross-attention block is added to the manual representation via a residual connection and layer normalization. This allows full bidirectional interaction between modalities but comes at significant computational cost. 3. 3.

Gated Fusion (Ours): Our proposed mechanism that dynamically weighs the importance of the lipreading stream based on a learned gating function applied to the manual stream features. This lightweight approach (single linear layer + sigmoid) explicitly models the confidence of the manual predictor and adaptively suppresses or emphasizes lip information accordingly.

As shown in Table 8, the Concatenation baseline performs poorly (12.4 BLEU-4), representing a 9.7 point drop from our full model. We attribute this to “noise injection” where without a gating mechanism, the model cannot suppress the lipreading stream during periods of silence or irrelevant mouth movements (such as natural facial expressions unrelated to linguistic content), leading to hallucinations and semantic drift.

Cross-Attention improves substantially over concatenation and provides a negligible improvement over the gated fusion method, demonstrating that bidirectional interaction between modalities is beneficial. However, this approach introduces significant computational overhead. Cross-attention requires $O(T^{2})$ operations per layer, whereas our gating mechanism requires only $O(T)$ operations.

Our Gated Fusion achieves comparative performance (22.1 BLEU-4) by explicitly learning when to rely on lip patterns (e.g., during fingerspelling sequences or when manual signs are ambiguous) and when to ignore them (e.g., during non-linguistic facial expressions or signer speech).

B.2 Zero-Shot Generalization and Pre-training Effects

We further investigated the transferability of our learned representations by evaluating zero-shot generalization from the large-scale Youtube-ASL dataset to the smaller, controlled How2Sign dataset. As shown in Section C.1, when trained solely on Youtube-ASL (1,000 hours, diverse signers and conditions) and evaluated on How2Sign without any fine-tuning, the model achieves a BLEU-4 of $8.3$ . While substantially lower than the supervised baseline, this is a non-trivial result for a zero-shot gloss-free system. For comparison, in the original YT-ASL paper [61] the authors report a B4 score of just $3.95$ .

Notably, training only on How2Sign (without Youtube-ASL pre-training) yields 13.7 BLEU-4 which is significantly worse than our two-stage approach (22.1). This demonstrates that the “in-the-wild” diversity of Youtube-ASL teaches the model robust, signer-independent features for phonemes, handshapes, and their temporal relationships. The controlled How2Sign environment, while higher quality, lacks the variability necessary for the model to learn truly generalizable representations. This is still significantly better than the other approaches shown in Section C.1.

Appendix C Part-of-Speech (POS) Analysis

A common issue with Sign Language Translation methods is when the model predicts correct content words (nouns, verbs) but fails to construct a grammatically valid sentence with appropriate function words (prepositions, determiners, auxiliary verbs). This failure is particularly prevalent in gloss-based approaches, since intermediate gloss representations typically omit function words entirely. To evaluate whether our model exhibits this same behaviour we conducted a comprehensive Part-of-Speech analysis.

C.1 Methodology

We ran Part-of-Speech tagging using the spaCy English language model (en_core_web_sm) on both the ground truth How2Sign references and our model’s generated translations. For each sentence, we extracted the distribution of POS tags and computed the accuracy for each tag compared with two SOTA approaches for SLT, Geo-Sign [14] and C2RL [9].

Bibliography71

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Albanie et al. [2021] Samuel Albanie, Gül Varol, Liliane Momeni, Hannah Bull, Triantafyllos Afouras, Himel Chowdhury, Neil Fox, Bencie Woll, Rob Cooper, Andrew Mc Parland, and Andrew Zisserman. BOBSL: BBC-Oxford British Sign Language Dataset. 2021.
2Aparicio et al. [2017] Mario Aparicio, Philippe Peigneux, Brigitte Charlier, Danielle Balériaux, Martin Kavec, and Jacqueline Leybaert. The neural basis of speech perception through lipreading and manual cues: Evidence from deaf native users of cued speech. Neuropsychologia , 2017.
3Asasi et al. [2025] Sobhan Asasi, Mohamed Ilyes Lakhal, and Richard Bowden. Hierarchical feature alignment for gloss-free sign language translation. In Adjunct Proceedings of the 25th ACM International Conference on Intelligent Virtual Agents , 2025.
4B. Shi and Livescu [2019] J. Keane D. Brentari G. Shakhnarovich B. Shi, A. Martinez Del Rio and K. Livescu. Fingerspelling recognition in the wild with iterative visual attention. ICCV , 2019.
5British Sign Language [2024] British Sign Language. British sign language resources. https://www.british-sign.co.uk/ , 2024.
6Camgoz et al. [2018] Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. Neural sign language translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018.
7Camgoz et al. [2020] Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020.
8Carreira and Zisserman [2017] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

SignBind-LLM: Multi-Stage Modality Fusion for Sign Language Translation

Abstract

1 Introduction

2 Related Work

2.1 Sign Language Understanding

2.2 Lipreading and Visual Speech Recognition

2.3 Multimodal Approaches

2.4 LLMs in SLT and VSR

3 Method

3.1 Problem Formulation

3.2 Stage 1: Target Generation via Text Pre-processing

3.3 Stage 2: Modality-Specific Encoders

3.4 Stage 3: Temporal-Aware Multi-modal Fusion

3.5 Stage 4: Language Model Refinement

3.6 Training Strategy

4 Experiments

4.1 Datasets

4.2 Evaluation Metrics

5 Results

5.1 Qualitative Results

5.2 Ablation Study

6 Conclusion

7 Acknowledgements

Contents

Appendix A Introduction

Appendix B Extended Ablation Studies

B.1 Analysis of Fusion Strategies

B.2 Zero-Shot Generalization and Pre-training Effects

Appendix C Part-of-Speech (POS) Analysis

C.1 Methodology