# A spatiotemporal transformer with cross-frame encoding and trajectory-aware decoding for multi-target fish tracking

**Authors:** Yang Li, Lei Han

PMC · DOI: 10.1038/s41598-025-31686-8 · Scientific Reports · 2025-12-10

## TL;DR

This paper introduces a new Transformer-based method for tracking multiple fish in challenging underwater environments, improving accuracy and reliability.

## Contribution

A unified Transformer framework combining cross-frame encoding and trajectory-aware decoding for multi-target fish tracking.

## Key findings

- The proposed method outperforms strong baselines on a self-constructed underwater fish tracking dataset.
- It achieves better tracking accuracy and stability on the UOT32 dataset compared to existing methods.
- The approach effectively handles occlusion and dense target conditions in complex underwater environments.

## Abstract

In response to the challenges of multi-object fish tracking in complex underwater environments, where performance is easily affected by illumination changes, suspended particles, occlusion, and high inter-target visual similarity, this paper proposes a unified Transformer framework that integrates cross-frame spatiotemporal encoding with trajectory-aware decoding. In the encoding stage, temporal difference and frame position embeddings are introduced and combined with a residual motion enhancement mechanism to explicitly align appearance, scale, and displacement across frames. In the decoding stage, trajectory extrapolation priors and temporal association attention are employed to restrict cross-frame feature aggregation within reasonable candidate regions, achieving joint optimization of detection and association. On our self-constructed underwater fish tracking dataset, the proposed method achieves MOTA, IDF1, and Recall scores of 0.719, 0.693, and 0.742, improving over the strong baseline model GTR (0.688, 0.671, 0.720) by 0.031, 0.022, and 0.022 absolute points. On the UOT32 dataset, it attains 0.697, 0.680, and 0.730, surpassing ByteTrack (0.675, 0.650, 0.700) by 0.022, 0.030, and 0.030 absolute points, respectively. These results demonstrate that the proposed approach effectively integrates cross-frame spatiotemporal modeling with trajectory-guided decoding, enabling accurate detection and reliable identity association even under occlusion and dense target conditions. The method exhibits strong robustness and generalization in complex underwater environments, outperforming existing state-of-the-art approaches in both tracking accuracy and stability.

## Full-text entities

- **Diseases:** MOT (MESH:D014012), confusion (MESH:D003221)
- **Chemicals:** water (MESH:D014867), UOT32 (-)
- **Cell lines:** UOT32 — Mus musculus (Mouse), Hybridoma (CVCL_B4FQ)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12808727/full.md

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12808727/full.md

## References

37 references — full list in the complete paper: https://tomesphere.com/paper/PMC12808727/full.md

---
Source: https://tomesphere.com/paper/PMC12808727