# SignBind-LLM: Multi-Stage Modality Fusion for Sign Language Translation

**Authors:** Marshall Thomas, Edward Fish, Richard Bowden

arXiv: 2509.00030 · 2025-12-05

## TL;DR

SignBind-LLM introduces a modular multi-stage approach for sign language translation, employing specialized predictors for different modalities and a transformer-based fusion to improve accuracy and handle asynchronous cues, setting new state-of-the-art results.

## Contribution

The paper presents a novel modular framework that separately decodes multiple sign language modalities and fuses them with a transformer before language modeling, improving translation quality.

## Key findings

- Achieved new state-of-the-art BLEU-4 scores on multiple datasets.
- Demonstrated improved letter accuracy and translation fidelity.
- Validated the effectiveness of modality-specific predictors and fusion strategy.

## Abstract

Despite progress in gloss-free Sign Language Translation (SLT), traditional single modality end-to-end approaches consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in SLT with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names, places, and technical terms. We introduce SignBind-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight transformer that resolves temporal misalignments before passing the combined representation to a Large Language Model (LLM) for final sentence generation. Our method establishes a new state-of-the-art on the How2Sign, ChicagoFSWildPlus, and BOBSL datasets with a BLEU-4 score of 22.1, 73.2% letter accuracy and BLEU-4 score of 6.8 respectively. These results validate our core hypothesis: isolating and solving distinct recognition tasks before fusion provides a more powerful and effective pathway to robust, high-fidelity sign language translation.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00030/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00030/full.md

## References

71 references — full list in the complete paper: https://tomesphere.com/paper/2509.00030/full.md

---
Source: https://tomesphere.com/paper/2509.00030