Enhanced CenterTrack for Robust Underwater Multi-Fish Tracking

Jinfeng Wang; Mingrun Lin; Zhipeng Cheng; Renyou Yang; Qiong Huang

PMC · DOI:10.3390/ani16020156·January 6, 2026

Enhanced CenterTrack for Robust Underwater Multi-Fish Tracking

Jinfeng Wang, Mingrun Lin, Zhipeng Cheng, Renyou Yang, Qiong Huang

PDF

Open Access

TL;DR

This paper introduces an improved tracking system for monitoring multiple fish in underwater environments, achieving high accuracy and reliability.

Contribution

The study proposes a CenterTrack-based framework with three complementary components for robust underwater multi-fish tracking.

Findings

01

The proposed method achieves an IDF1 score of 82.5% on the MF25 dataset.

02

The framework improves trajectory continuity under occlusions and abrupt motion.

03

The system is computationally efficient and suitable for real-time tracking in aquaculture.

Abstract

Tracking multiple fish in underwater environments is essential for studying fish behavior and supporting ecological monitoring. However, poor visibility, complex backgrounds, and frequent occlusions make reliable tracking difficult in real underwater scenes. In this study, we propose an improved method for underwater multi-fish tracking based on CenterTrack. The proposed approach improves tracking stability and accuracy under challenging underwater conditions. Experimental results on underwater fish datasets show that our method performs more reliably than existing approaches. This study provides a practical tool for automated analysis of fish behavior in natural underwater environments. Accurate monitoring of fish movement is essential for understanding behavioral patterns and group dynamics in aquaculture systems. Underwater scenes—characterized by dense populations, frequent…

Figures12

Click any figure to enlarge with its caption.

Funding1

—Special Fund for Marine Economic Development (six marine industries) of Guangdong Province, China

Keywords

multi-object fish trackingunderwater aquaculturemulti-branch feature extractionocclusion handlingMF25 dataset

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWater Quality Monitoring Technologies · Video Surveillance and Tracking Methods · Zebrafish Biomedical Research Applications

Full text

1. Introduction

With the continuously growing global demand for aquaculture products, intelligent aquaculture has emerged as a key approach to achieving sustainable and efficient fishery production. In intensive aquaculture systems, large numbers of fish are cultivated in enclosed or semi-enclosed environments, making automated and non-invasive monitoring of fish movement, behavior, and group dynamics increasingly important [1,2,3]. Behavioral indicators such as swimming patterns, social interactions, and responses to environmental stimuli provide valuable information for welfare assessment, feeding optimization, and behavioral research [4,5,6].

Video-based monitoring using computer vision has shown considerable potential for extracting fine-grained movement trajectories and enabling quantitative analysis of fish behavior [7,8,9]. Accurate and continuous multi-fish tracking therefore serves as a fundamental technical component for a wide range of downstream aquaculture studies, including behavioral analysis and population-level monitoring. Although this work does not directly model biological or physiological indicators, reliable trajectory extraction is a necessary prerequisite for such higher-level analyses in both aquaculture engineering and fish behavior research.

Despite recent progress, underwater multi-fish tracking remains a challenging task due to complex imaging conditions, including uneven illumination, light refraction, water turbidity, and suspended particles [10]. In addition, fish exhibit non-rigid body deformation, rapid and abrupt motion, and highly similar visual appearances. High-density farming environments further exacerbate these challenges by introducing frequent occlusions and frequent target interactions, making robust identity association and long-term trajectory continuity particularly difficult [11,12,13].

Existing studies have explored a variety of vision-based approaches for fish monitoring and tracking. These include methods for estimating ventilation rates [14], swimming posture, and tail-beat frequency [15], as well as object detection and tracking pipelines based on You Only Look Once (YOLO), Deep Simple Online and Realtime Tracking (DeepSORT), and optical flow for fish counting and population management [16]. While these methods demonstrate promising performance in controlled or low-density environments, they often rely on heuristic association strategies and exhibit frequent identity switches and fragmented trajectories under dense occlusions and fast motion, limiting their suitability for continuous and fine-grained behavioral analysis [17]. Although these recent approaches provide valuable benchmarks, they are mostly limited to low-density or controlled conditions and do not fully address the challenges of dense aquaculture environments with frequent occlusions, rapid non-rigid motion, and visually similar individuals.

More recently, general-purpose multi-object tracking (MOT) frameworks, including joint detection-and-tracking methods such as CenterTrack [18], tracking-by-detection approaches such as OC_SORT, and transformer-based trackers such as TransCenter, TPTrack, and MOTR, have achieved strong performance on terrestrial benchmarks. Several recent studies have attempted to apply or adapt these frameworks to aquatic or underwater scenarios. However, most existing MOT methods are primarily designed for rigid objects and relatively stable imaging conditions, and their direct application to underwater multi-fish scenarios remains suboptimal. Frequent occlusions, appearance ambiguity, and abrupt motion changes commonly observed in aquaculture environments challenge the robustness of existing motion modeling and data association strategies. Although sonar-based or multimodal systems can partially alleviate visual limitations, their lower spatial resolution, sensitivity to noise, and higher computational cost restrict their applicability in high-precision behavioral studies [19,20].

To address these challenges, we propose an improved CenterTrack-based framework for multi-fish tracking in dense aquaculture environments. The proposed method aims to produce stable and high-quality trajectories under realistic underwater conditions, providing a reliable technical foundation for downstream behavioral analysis. In addition, we introduce MF25 dataset, which consists of 75 individually annotated fish recorded under diverse lighting conditions, population densities, and occlusion scenarios, offering a realistic benchmark for evaluating multi-fish tracking performance in aquaculture settings.

To overcome the above limitations and improve the robustness of multi-fish tracking in realistic aquaculture environments, this work makes the following contributions:

(1)A multi-fish tracking framework based on CenterTrack is proposed for reconstructing and refining historical features, which can better adapt to dense occlusions and rapid, non-rigid fish motion and improve trajectory stability and reducing identity switches.
(2)Adaptive motion–appearance fusion and occlusion-aware trajectory recovery strategies are introduced to handle unreliable associations caused by fast motion, visual ambiguity, and prolonged occlusions. which can improve tracking robustness in realistic aquaculture environment.
(3)A self-built MF25 multi-fish tracking dataset is constructed, containing annotated video sequences of 75 fish captured in a unified underwater scene. It features diverse motion patterns and complex group interactions, providing frame-level detections and trajectory-level ground truth for systematic evaluation.
(4)Extensive experiments and ablation studies show that the proposed method achieves robust tracking performance, producing trajectories with improved continuity and identity consistency, suitable for downstream behavioral and group-dynamics analysis.

The remainder of this article is organized as follows: Section 2 describes the proposed methodology. Section 3 introduces the MF25 dataset and experimental setup and presents the experimental results and comparisons with baseline methods. Section 4 discusses limitations and potential applications. Section 5 concludes this study.

2. Materials and Methods

2.1. Framework

In practical industrialized fish farming environments, underwater video data often suffer from prolonged occlusions, dense fish aggregations, and severe morphological deformations due to rapid movement. These challenges significantly impair tracking accuracy, falling short of operational requirements. Additionally, farm-raised fish of the same species often exhibit highly similar appearances and body structures, making individual identification and multi-object tracking particularly difficult.

To address the above challenges, we propose an enhanced CenterTrack-based multi-fish tracking framework, whose system architecture is illustrated in Figure 1. The model takes three inputs: the current frame image $[eqn]$ , the previous frame image $[eqn]$ , and the previous frame’s heatmap features $[eqn]$ . A novel Multi-scale Attention and Dynamic Weighting (MADW) module is introduced to process features extracted from $[eqn]$ and $[eqn]$ , improving robustness against prolonged occlusions and severe non-rigid deformations. The refined historical features are then fused with features from $[eqn]$ to generate enriched spatial representations.

These fused representations are fed into a Deep Layer Aggregation (DLA-34) backbone [21] for multi-scale feature extraction, as shown in Figure 2. In addition to the original output heads, we introduce an Occlusion-Aware Head (OAHead), which enhances edge-aware feature representations and promotes identity consistency for visually similar fish under heavy occlusion. During training, Focal Loss [22] and Dice Loss [23] are jointly employed with an adaptive weighting mechanism to balance positive and negative samples while improving boundary continuity. Finally, the processed features are passed to a Cascade Correspondence Association (CCA) algorithm, which performs a three-stage matching strategy to handle detections with varying confidence levels, enabling efficient trajectory update or termination for real-time online tracking.

2.2. Multi-Scale Attention and Dynamic Weighting

The MADW module refines input features using two complementary layers: Multi-scale Attention and Dynamic Weighting, shown in Figure 3.

The Multi-scale Attention layer adaptively fuses spatial and channel-wise features across multiple scales, improving both local and global contextual dependencies. Channel-grouped features undergo 1 × 1 and 3 × 3 convolutions to capture fine-grained and contextual information, respectively. Horizontal pooling reduces spatial dimensionality while preserving structural cues. The pooled features are concatenated and further fused using a 1 × 1 convolution, followed by Sigmoid activation to emphasize critical channels and refine local features. This design significantly enhances robustness against rapid movement and severe occlusions by exploiting multi-scale information from preceding frames.

The Dynamic Weighting layer adjusts the importance of different feature branches in real time. It employs three parallel convolutional branches (1 × 1, 3 × 3, and 5 × 5) to extract features with varying receptive fields, along with global average pooling for dimensionality reduction. These multi-scale features are concatenated and fused via a 1 × 1 convolution to generate enhanced representations. This mechanism adaptively adjusts the fusion weights between historical heatmap features and current frame features, effectively accommodating scale and positional variations while suppressing redundant information and emphasizing key regions. Together, these two layers improve the model’s perception in challenging underwater tracking scenarios.

2.3. Occlusion-Aware Head

The OAHead processes DLA-34 backbone outputs through two branches aimed at occlusion prediction and feature rectification. Each branch uses parallel 3 × 3 and 1 × 1 convolutions; the occlusion branch further applies 1 × 1 convolution for dimensionality reduction, while the rectification branch employs Instance Normalization to retain instance-level details.

To capture global spatial features without positional loss, horizontal and vertical pooling are applied, followed by concatenation and 1 × 1 convolution for efficient dimension reduction. Separable 1 × 1 convolutions learn independent attention weights along horizontal and vertical directions. Both branches use Sigmoid activations for soft attention and GeLU for nonlinear enhancement, extracting features that are robust to occlusion while suppressing irrelevant background noise, as shown in Figure 4.

To address class imbalance, the module employs a standard Focal Loss:

[eqn]

where $[eqn]$ is the predicted occlusion probability, $[eqn]$ the ground-truth label, and $[eqn]$ the hard-sample modulation factor. Dice Loss is added to maximize the IOU between predicted and true masks, mitigating pixel-level imbalance:

[eqn]

An adaptive weighting mechanism dynamically adjusts $[eqn]$ (the Focal Loss weight) and $[eqn]$ (the Dice Loss weight) based on gradient directions, ensuring a balanced contribution from each term. The total loss is thus:

[eqn]

where $[eqn]$ is the initial occlusion-loss weight. This multi-loss synergy enables the model to handle severe occlusions (via Focal Loss) while preserving precise boundary delineation (via Dice Loss), resulting in robust occlusion-aware feature learning.

[eqn]

2.4. Cascade Correspondence Association

The proposed CCA fuses Kalman filtering with a three-stage cascade matching pipeline to robustly associate predicted tracks with incoming detections. The detailed execution flow of the proposed method is formally presented in Algorithm 1.

As illustrated in the algorithm, the procedure accepts the current frame’s detections as input and initializes the matching process. The core association mechanism (Lines 3–5) iterates through a defined sequence of three affinity metrics: Motion consistency, IoU overlap, and Euclidean Distance. In the first stage, the algorithm computes the Mahalanobis distance between Kalman-predicted states and detection measurements to handle motion consistency. Unmatched pairs then proceed to the IoU matching stage, which accommodates non-rigid deformations typical of swimming fish. Finally, any remaining ambiguous tracks undergo distance matching based on centroid proximity. This function Match extracts unmatched pairs, computes the specific affinity matrix for the current stage, and solves the assignment problem using the Hungarian algorithm. Algorithm 1 Triple-Stage Cascade Matching

1:procedure Step( $[eqn]$ )
2: $[eqn]$ , $[eqn]$
3: for stage $[eqn]$ {Motion, IoU, Distance} do
4: $[eqn]$ Match $[eqn]$
5: Update $[eqn]$ with $[eqn]$
6: end for
7: for unmatched $[eqn]$ do
8: if $[eqn]$ then
9: Initialize new track, add to $[eqn]$
10: end if
11: end for
12: for unmatched $[eqn]$ do
13: if $[eqn]$ then
14: Propagate with motion model, decay score, add to $[eqn]$
15: end if
16: end for
17: return $[eqn]$
18:end procedure
19:function Match(stage, detections, tracks, matched)
20: Extract unmatched pairs
21: Compute affinity (motion/IoU/distance)
22: Solve with Hungarian algorithm
23: return valid matches
24:end function

Following the cascade matching, the algorithm manages the lifecycle of the trajectories. As detailed in Lines 6–10, detections that remain unmatched after all three stages are treated as potential new targets; if their confidence score exceeds $[eqn]$ , a new track is initialized. Conversely, Lines 11–15 handle existing tracks that failed to match any detection. These tracks are propagated using the motion model to predict their position in the next frame. However, their confidence scores are decayed, and if a track remains unmatched for a period exceeding $[eqn]$ , it is considered lost and purged from the memory to prevent false-positive drift.

By cascading these three complementary criteria as structured in Algorithm 1, CCA strikes a balance between accuracy and efficiency, delivering robust, real-time multi-object tracking performance as shown in Figure 5.

2.5. Evaluation Metrics

To comprehensively evaluate the performance of multi-fish tracking, we adopt a set of standard multi-object tracking metrics. These metrics cover identity-level accuracy, detection precision, and trajectory completeness, strictly adhering to the standard definitions used in state-of-the-art literature [24,25].

We employ Identification F1-score (IDF1) and Multiple Object Tracking Accuracy (MOTA) as the primary evaluation metrics.

IDF1 measures the consistency between predicted and ground-truth identities. It is essentially the F1-score calculated based on identity-level matching and is defined as follows:

[eqn]

where $[eqn]$ (Identification True Positives), $[eqn]$ (Identification False Positives), and $[eqn]$ (Identification False Negatives) denote the global counts of correctly identified detections, false identifications, and missed identifications, respectively.

MOTA provides a holistic measure of tracking accuracy by jointly accounting for detection errors and identity switches:

[eqn]

where $[eqn]$ (False Negatives), $[eqn]$ (False Positives), and $[eqn]$ (Identity Switches) represent the total counts of errors summed across all frames, and $[eqn]$ is the total number of ground-truth objects.

For a more detailed evaluation of identity consistency, we report Identification Precision (IDP) and Identification Recall (IDR) independently:

[eqn]

[eqn]

To evaluate trajectory continuity, we adopt Mostly Tracked (MT) and Mostly Lost (ML) with separate definitions:

[eqn]

[eqn]

where $[eqn]$ is the total number of ground-truth trajectories, $[eqn]$ and $[eqn]$ denote the number of trajectories tracked for at least $[eqn]$ and less than $[eqn]$ of their lifespan, respectively.

Finally, Higher Order Tracking Accuracy (HOTA) [25] provides a unified evaluation of detection and association:

[eqn]

where Detection Accuracy (DetA) and Association Accuracy (AssA) are defined using global tracking components:

[eqn]

[eqn]

Here, $[eqn]$ denote the global counts of detection-level true positives, false positives, and false negatives (where $[eqn]$ ). For association, $[eqn]$ represents the association score for each matched detection c.

2.6. Training and Inference

The proposed framework has two distinct phases: offline training and online inference.

All video frames used for training were collected under consistent lighting and water turbidity conditions, ensuring uniform image quality and avoiding potential biases due to environmental variations.

During training, the network learns robust features for fish detection, occlusion handling, and trajectory association. Annotated video data are used to update all model parameters, minimizing Focal Loss, Dice Loss, and adaptive weighting mechanisms. This process is performed offline, usually on high-end GPUs.

During inference, the trained model is applied to new video frames in real-time. No parameters are updated; the model performs forward passes only. Modules such as MADW, OAHead, and CCA continue to refine features, enhance occlusion robustness, and associate tracks across frames. In this phase, the focus is on fast and accurate processing of incoming video data.

Although actual performance depends on hardware and input resolution, the model can typically achieve 30 FPS on Powerful GPUs such as the NVIDIA RTX 4090, with memory usage around 6–8 GB per frame. These numbers suggest that the system is suitable for industrial deployment in aquaculture, balancing both accuracy and computational efficiency.

3. Experiments

3.1. DataSet

MF25 is a multi-fish tracking dataset consisting of short video clips, each containing 25 consecutive frames, recorded in aquaculture tanks with up to 75 simultaneously visible fish. The dataset was collected during December 2024 at South China Agricultural University in Guangzhou, China. The experimental configuration comprised a 2.5 m (Length) × 1.5 m (Width) × 0.9 m (Height) aquaculture tank stocked with 75 grass carp (Ctenopharyngodon idellus) ranging from 8 to 25 cm in total length (including caudal fin), simulating industrial aquaculture conditions. All experiments follow common aquaculture practice, where each tank contains only a single fish species, and tracking is performed within the same enclosure. The tank incorporated an aerator and environmental monitoring systems to maintain optimal aquatic conditions and track growth parameters, as shown in Figure 6. A Shanghai Huachuan Technologies ZED 2i IP66-rated stereo camera (Shanghai, China)—engineered for harsh environment operation and suitable for agricultural, industrial, and pharmaceutical applications—was deployed. Specifically, the camera was explicitly configured to record at a resolution of 1920 × 1080 pixels and a frame rate of 30 frames per second (FPS) to ensure high-definition data acquisition, as shown in Figure 7 and Table 1.

The recorded footage exhibits frequent severe occlusion instances and morphological deformations, with subjects occasionally exiting the frame due to camera perspective constraints. Schooling behavior predominates in most sequences, inducing rapid locomotion, while residual footage contains slow-moving specimens susceptible to persistent occlusion. We segmented 10–15 s clips into 1917 training frames and 350 test frames. Each frame was annotated by a single expert; inter-annotator agreement was not applicable. All sequences were annotated following the Multiple Object Tracking Challenge (MOTChallenge) format specifications.

Experimental trials used a deep learning server equipped with an NVIDIA RTX 4090 graphics processing unit (GPU), an Intel Xeon Gold 6338 central processing unit (CPU), and 128 GB random access memory (RAM). The framework was implemented in PyTorch 1.13.1+cuda 11.7 with the following parameters: 512 × 512 input resolution, batch size 8, learning rate 1.25 × 10^−4^, dropout rate 0.1, and 70 training epochs. During tracking, detection thresholds were set to track_thresh = 0.4 and pre_thresh = 0.4, with trajectories failing to match for over 3 consecutive frames were discarded.

3.2. Comparative Study

To evaluate the performance of ours method in multi-fish underwater tracking, we conducted comprehensive comparisons with two tracking paradigms:joint detection and embedding (JDE) and separate detection and embedding (SDE) and other relevant models. All experiments were performed on the underwater MF25 dataset to validate our model’s efficacy in complex multi-fish tracking scenarios.

To demonstrate the performance of ours method for complex underwater multi-fish tracking, we compared it against classic algorithms including SORT [26], DeepSORT [27,28], ByteTrack [29], OCSORT [30], CenterTrack, MOTR [31], FairMOT [32], TransCenter [33], TransTrack [34], TPTrack [35] and Ours Method using the specified metrics. All results were obtained by training and tracking on the MF25 dataset to evaluate detection accuracy and tracking capability for the same targets.shown in Figure 8.

Given the frequent target loss and identity switches characteristic of fish objects, the IDF1 metric was employed as a critical measure of identity-preserving tracking performance, while the MOTA metric was used to assess overall multi-object tracking accuracy. As shown in Table 2, the SORT method is suitable for real-time systems, while DeepSORT improves upon SORT by enhancing occlusion handling. ByteTrack combines the strengths of SORT and DeepSORT and demonstrates particular effectiveness in extremely crowded scenarios. OCSORT excels in high-speed motion scenes and shows advantages for tracking fast-moving underwater fish. FairMOT offers robust recognition in dense scenes, effectively addressing underwater crowding challenges compared to SORT and DeepSORT approaches. TransTrack provides a lightweight solution that achieves SORT-like speed while maintaining good accuracy. TransCenter achieves significant improvements in long-range temporal association, substantially boosting performance for underwater multi-fish tracking. Unlike TransCenter, MOTR offers stronger handling of fish deformation and occlusion. TPTrack represents the performance leader among non-Transformer models, significantly reducing computational overhead and resource consumption.

Our proposed method significantly enhances the accuracy and robustness of inter-frame association in multi-fish tracking tasks. Specifically, by introducing the MADW module, the model effectively strengthens feature representation during inter-frame feature extraction, thereby increasing the information density and discriminative capability of extracted features. This improvement provides a more reliable foundation for subsequent matching and tracking processes.

In addition, the designed OAHead module and the improved loss function effectively reduce misidentification and identity confusion among different fish under severe occlusion and high-density crowding scenarios. These targeted enhancements substantially improve the model’s robustness and generalization ability in complex underwater environments.

Moreover, the proposed CCA algorithm further improves identity preservation during the tracking process. It enables the tracker to avoid identity switches and target losses even when fish interact frequently, undergo large pose changes, or experience partial occlusions, ensuring more consistent and continuous trajectories.

The comprehensive experimental results demonstrate that our approach demonstrates competitive or superior performance across multiple metrics compared with Transformer-based methods such as TransCenter and TPTrack in multi-fish tracking scenarios. Compared with the computationally intensive TransTrack, our method achieves a favorable balance between tracking accuracy and model complexity.

Among all evaluated metrics, our improved method achieves the highest scores in both MOTA and IDF1, clearly validating its superior performance and strong robustness in complex multi-fish tracking environments. These results indicate that our framework not only surpasses existing approaches in accuracy but also exhibits remarkable advantages in model efficiency and practical applicability.

To further demonstrate robustness, the publicly available Uniform and Deformable Multi-fish Tracking (UD-MFT) dataset [36], which contains both underwater and above-water fish tracking sequences. To comprehensively evaluate the tracking performance, we compared the proposed method with a series of representative multi-object tracking algorithms, including SORT, DeepSORT, ByteTrack, FairMOT, CenterTrack, TransTrack, GTR [37], OCSORT, and TPTrack. As shown in Table 3, OCSORT achieved the highest IDF1 and IDR scores, while TPTrack obtained the best MOTP. Our improved CenterTrack framework (“Ours”) achieved the highest IDP and competitive overall performance, demonstrating the effectiveness of the proposed enhancements in multi-fish tracking. Our method achieves superior performance in both DetA and MOTA metrics. This demonstrates our model’s excellence in both single-frame detection capability and cross-sequence association ability, establishing it as a well-balanced and powerful approach that performs exceptionally in both core aspects of “detection” and “association”.

For the MF25 dataset, the model was trained for 70 epochs using the training split, while performance was monitored on a held-out validation set. Standard data augmentation techniques were applied during training, including random horizontal flipping, scale jittering, and color perturbation, to improve robustness to underwater appearance variations. Early stopping was not adopted; instead, the model checkpoint achieving the best validation performance was selected for final evaluation on the test set.

For the UDMFT dataset, the same training and validation protocol was followed to ensure consistency. The model was fine-tuned for 70 epochs using the official training split, with identical data augmentation settings. Model selection was based on validation performance, and all evaluations were conducted under the same evaluation metrics and protocols as those used for MF25.

The original resolution for all models was configured according to their backbone network requirements, ensuring that no model’s performance was compromised by resolution issues. In Table 4, we compared different backbone networks, providing detailed analysis of DLA, MobileNetV2 [38] and ResNet [39] across various dimensions. The results clearly indicate that DLA34 delivers the best performance among all tested backbone networks. All other parameters maintained their original default configurations.

3.3. Ablation Study and Analysis

Ablation studies were conducted on our enhanced CenterTrack framework in Table 5. Integration of the proposed MADW module into the three-branch input elevated performance to 78.9% IDF1, 83.9% MOTA, 81.6% IDP, 76.3% IDR, and 20.2% MOTP, confirming its efficacy in feature extraction enhancement. Subsequent incorporation of the OAHead module improved occlusion handling and edge discrimination through its specialized loss function, which optimizes positive-negative sample balance in occluded regions, enhances prediction-ground truth IOU alignment, and dynamically regulates loss contributions without manual intervention. This configuration achieved 79.0% IDF1, 81.7% MOTA, 81.8% IDP, 76.3% IDR, and 20.3% MOTP. shown in Figure 9.

Combined implementation of MADW and OAHead modules yielded 79.8% IDF1, 85.0% MOTA, 81.9% IDP, 77.8% IDR, and 20.0% MOTP. Final integration of our CCA method employing three-stage cascade matching (motion correlation, IOU similarity, distance association) to reconcile predicted and observed trajectories—enabled recovery of occluded and deformed targets, achieving peak performance of 82.5% IDF1, 85.8% MOTA, 84.7% IDP, 80.4% IDR, and 20.0% MOTP. These experiments collectively validate each module’s contribution to enhanced detection, tracking, and association capabilities.

3.4. Hyperparameter Optimization

We conducted hyperparameter experiments based on the improved CenterTrack framework. This study compared five key parameters: Track_Thresh, Pre_Thresh, BatchSize, Learning_Rate, and Epoch.

The results in Figure 10 demonstrate that setting both Track_Thresh and Pre_Thresh to 0.4 yields optimal network performance. At these values, the model achieves its highest scores: 82.5% IDF1, 85.8% MOTA, 84.7% IDP, 80.4% IDR, and 20% MOTP. Alternative parameter configurations resulted in varying degrees of performance degradation across these metrics.

For BatchSize, we evaluated settings of 2, 4, 8, 16, and 32. Performance metrics decreased slightly at BatchSize = 2, while values at 4, 16, and 32 remained comparable. This indicates that increasing BatchSize beyond 16 provides negligible performance gains. Given that BatchSize = 32 requires dual RTX 4090 GPUs for execution, we selected BatchSize = 8 as our final parameter to balance computational efficiency with hardware compatibility while maintaining peak performance.

Regarding the learning rate, experiments were conducted with values of 1 × 10^−4^, 1.25 × 10^−4^, 1.5 × 10^−4^, 1.75 × 10^−4^, and 2 × 10^−4^. The best performance was achieved at a learning rate of 1.25 × 10^−4^, which is likely due to faster and more stable convergence during training.

Epochs were evaluated at 60, 70, 80, 90, and 100. As shown in Figure 10, tracking performance improves steadily from 60 to 70 epochs, indicating progressive model convergence. The best performance is achieved at 70 epochs. When training is extended beyond 70 epochs, both IDF1 and MOTA exhibit a gradual decline, suggesting overfitting to the training data. Therefore, 70 epochs represent an optimal trade-off between sufficient convergence and generalization performance, and all experiments in this study were conducted using 70 training epochs.

Figure 11 shows the training loss curves of the proposed method over 70 epochs, including the total loss and individual loss components, which consistently decrease and converge without significant oscillation, indicating stable and balanced optimization. The loss curves become relatively stable after around 50 epochs, suggesting that training for 70 epochs is sufficient to ensure convergence while avoiding unnecessary computational cost.

For a fair comparison, all baseline tracking methods were retrained on the MF25 training set using the same data split as the proposed method. Pre-trained weights provided by the original authors were used only for initialization, and all models were fine-tuned on MF25 to adapt to the underwater aquaculture scenarios.

Hyperparameters for each baseline were selected following the recommendations in their original papers and adjusted within a limited range to achieve stable convergence on MF25. No dataset-specific tuning beyond standard practices was performed, and all methods were evaluated under identical training and evaluation protocols.

4. Discussion

4.1. Key Features and Contributions

The proposed framework supports efficient multi-fish tracking while also estimating activity levels and group dynamics. By quantifying trajectory length and aggregate activity, the approach enables more precise assessment of fish vitality and adaptability to environmental changes. The method also mitigates the impact of water flow and other environmental disturbances on activity measurements, increasing the likelihood that observed motion reflects true fish behavior rather than imaging artifacts.

Ablation studies and comparative experiments demonstrated that the ours model maintains high tracking accuracy while preserving computational efficiency suitable for online tracking. This ensures that subsequent analyses, such as activity evaluation and trajectory assessment, are based on reliable and continuous tracking data. Visualization of group trajectories further confirms that the method captures group-level behavior accurately, supporting applications in behavioral analysis and aquaculture management.

4.2. Correlation Between Tracking Accuracy and Activity Estimation

Our analysis revealed that errors in activity estimation mainly arise from identity switches or missed tracking events. shown in Figure 12, the mean absolute percentage error (MAPE) per frame for both trajectory length and activity levels is inversely correlated with tracking metrics such as IDF1 and MOTA. This indicates that the reliability of activity measurements depends critically on robust identity preservation and continuous trajectory tracking.

4.3. Limitations and Future Directions

Although the proposed framework demonstrates strong tracking performance in controlled aquaculture environments, its scope and applicability should be interpreted with appropriate caution.

(1)Environmental diversity and domain shift: Experiments were conducted in relatively controlled tank environments that simulate industrial aquaculture. Although these settings capture realistic occlusion and crowding patterns, environmental diversity remains limited. Variations in illumination, water turbidity, camera viewpoints, and background complexity in real production facilities may introduce domain shifts, potentially degrading tracking performance when the model is directly applied without further adaptation.(2)Identity persistence under long-term occlusion: The proposed tracker maintains short-term visual identities rather than persistent biological identities. When fish leave the field of view or experience prolonged occlusions, identity consistency may be compromised. This limitation is particularly relevant for long-term behavioral or individual-based studies, where identity switches may affect the interpretation of fine-grained behavioral patterns.(3)Scope of behavioral analysis: The current study primarily focuses on trajectory-based indicators such as activity level and movement length. While these metrics are informative, they do not fully capture complex group dynamics or physiological states. Future work should investigate richer behavioral descriptors and their relationships with fish health, stress, and environmental variables.(4)Computational constraints and deployment: Although a systematic analysis of computational efficiency across different hardware platforms was not conducted, the proposed framework achieves real-time tracking under the current experimental configuration with 30 FPS input video on a single RTX 4090 GPU. The introduced multi-stage association and motion modeling components are lightweight and do not substantially increase computational complexity compared to the baseline CenterTrack framework. Nevertheless, deployment on resource-constrained edge devices remains challenging due to the deep backbone. Future work will focus on model compression, backbone optimization, and lightweight association strategies to further improve deployment flexibility.(5)2D imaging constraints and depth ambiguity: The proposed framework is evaluated under a monocular 2D imaging setup, which inherently limits observability along the depth axis. Fish moving outside the camera field of view or overlapping in depth cannot be continuously tracked, and depth information from the stereo setup was not explicitly exploited in this study. These factors constrain long-term continuous monitoring and should be addressed in future work through multi-view or depth-aware extensions.

Future research will extend the dataset to environments with higher fish densities and more complex visual conditions to improve robustness under domain shifts. Incorporating re-identification features and long-term temporal modeling will be important for preserving identity continuity in extended observations. In addition, optimizing software and hardware configurations will be essential for scalable, real-time deployment in practical aquaculture systems.

4.4. Implications for Aquaculture and Behavioral Analysis

The proposed framework provides a practical tool for quantitative behavioral analysis in aquaculture. By combining accurate multi-fish tracking with reliable activity estimation, it enables:

(1)Non-invasive monitoring of fish vitality and adaptability to environmental changes.
(2)Real-time assessment of group behavior and schooling patterns.
(3)Data-driven support for feeding strategies, health management, and tank design.

Overall, this method offers a foundation for automated aquaculture monitoring and future studies on fish behavior, bridging the gap between advanced tracking technology and practical aquaculture applications.

5. Conclusions

This study introduced an enhanced CenterTrack-based framework for multi-fish tracking, which was designed to generate high-quality and continuous trajectories in dense underwater environments with frequent occlusions and non-rigid fish motion. The proposed framework effectively addressed challenges such as temporary target loss, appearance similarity, and scale variation, enabling stable identity preservation across frames.

To evaluate its performance, we constructed MF25, a real-world underwater multi-fish dataset containing 75 individually annotated grass carp recorded under varied conditions, including occlusions, varying illumination, and high population density. On MF25, the proposed approach achieved an IDF1 score of 82.5%, a MOTA of 85.8%, and an IDP of 84.7%, demonstrating robust identity consistency and accurate multi-fish tracking performance.

These results indicated that the framework reliably produced high-precision trajectories suitable for downstream behavioral analyses, including swimming pattern characterization, social interaction analysis, and responses to environmental changes. While performance under extreme low-light conditions and deployment on resource-constrained devices remained a limitation, the framework provided a practical tool for behavior-driven research and intelligent monitoring in industrial aquaculture systems. Future work would focus on domain adaptation, re-identification features, and model compression to broaden applicability.

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Liu Y. Li B. Zhou X. Li D. Duan Q. Fish Track: Multi-object tracking method for fish using spatiotemporal information fusion Expert Syst. Appl.202423812219410.1016/j.eswa.2023.122194 · doi ↗
2Zhang H. Li W. Qi Y. Liu H. Li Z. Dynamic fry counting based on multi-object tracking and one-stage detection Comput. Electron. Agric.202320910787110.1016/j.compag.2023.107871 · doi ↗
3Zhao H. Cui H. Qu K. Zhu J. Li H. Cui Z. Wu Y. A fish appetite assessment method based on improved Byte Track and spatiotemporal graph convolutional network Biosyst. Eng.2024240465510.1016/j.biosystemseng.2024.02.011 · doi ↗
4Mei Y. Sun B. Li D. Yu H. Qin H. Liu H. Yan N. Chen Y. Recent advances of target tracking applications in aquaculture with emphasis on fish Comput. Electron. Agric.202220110733510.1016/j.compag.2022.107335 · doi ↗
5Liu Y. Li B. Si L. Liu C. Li D. Duan Q. Group activity amount estimation for fish using multi-object tracking Aquac. Eng.202511010253410.1016/j.aquaeng.2025.102534 · doi ↗
6Liu Z. Wang X. Wang C. Liu W. Bai X. Sparse Track: Multi-object tracking by performing scene decomposition based on pseudo-depth IEEE Trans. Circuits Syst. Video Technol.2025354870488210.1109/TCSVT.2024.3524670 · doi ↗
7Cui M. Liu X. Liu H. Zhao J. Li D. Wang W. Fish tracking, counting, and behaviour analysis in digital aquaculture: A comprehensive survey Rev. Aquac.2025171300110.1111/raq.13001 · doi ↗
8Xi C. Cui M. Yin J. Gu H. Ouyang T. Feng J. Zeng L. Enhanced deep OC-SORT with YOL Ov 8-seg for robust fry tracking and behavior analysis in aquaculture Aquaculture 202561074288710.1016/j.aquaculture.2025.742887 · doi ↗