DenseDuckMOT: A Real-Time Detection-Tracking Coupled Counting Framework for Complex Avicultural Environments
Jiaxing Xie, Jiatao Wu, Liye Chen, Yue Cao, Zihao Chen, Meiyi Lu, Yujian Lin, Chunxi Tu, Weixing Wang, Jinshui Lin

TL;DR
This paper introduces DenseDuckMOT, a real-time system for accurately counting ducks in crowded barns using lightweight detection and tracking methods.
Contribution
The novel framework combines an improved detector (DuckNet) and a robust tracker (AKFTrack) for efficient and accurate duck counting in complex environments.
Findings
DenseDuckMOT achieved 98.19% precision and 97.72% recall for duck detection and tracking.
The AKFTrack tracker outperformed existing methods in crowded and occluded scenes.
The system supports real-time monitoring with minimal hardware requirements.
Abstract
Counting ducks in crowded barns is challenging because individuals frequently overlap, move rapidly, and appear blurred in surveillance footage. We propose a lightweight visual perception and online tracking method that detects each Liancheng White Duck and maintains consistent identities across frames for stable, real-time counting. The detector was trained on 2416 annotated images, and the tracking performance was evaluated on five real surveillance videos from a breeding farm. The method achieved 98.19% precision and 97.72% recall with a compact model size of about 4.5 KB, supporting deployment on resource-limited devices. In densely crowded and heavily occluded scenes, the tracking component produced more continuous trajectories and fewer identity mix-ups than commonly used tracking approaches. This work reduces manual workload and unnecessary human disturbance, providing a…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14- —Fujian Provincial Individual Science & Technology Commissioner Project
- —Fujian Provincial Team Science & Technology Commissioner Project
- —College Student Innovation and Entrepreneurship Training Program, China
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnimal Behavior and Welfare Studies · Video Surveillance and Tracking Methods · Smart Agriculture and AI
1. Introduction
Livestock production remains a cornerstone of rural livelihoods [1], yet its productivity and sustainability increasingly depend on efficient monitoring and management technologies. Among China’s diverse poultry resources, the Liancheng White Duck is a nationally protected rare breed with a century-long breeding history and considerable conservation and utilization value [2]. However, current farming practices still rely heavily on manual counting, which is labor-intensive, error-prone, and prone to inducing stress responses that compromise both animal welfare and farming efficiency [3]. Manual handling also heightens the risk of pathogen introduction into duck houses, posing additional threats to the farming environment [4]. To mitigate these issues, researchers have increasingly turned to computer vision and deep learning techniques for automated detection and tracking in poultry farming. While such methods reduce human intervention and improve counting accuracy, they remain challenged in high-density duck barns, particularly under conditions of occlusion, motion blur, and uneven illumination. These challenges highlight the urgent need for a lightweight, high-precision, and real-time detection and tracking system tailored to the complexities of intensive poultry farming [5].
In recent years, computer vision and deep learning have been increasingly applied to livestock and poultry monitoring, with related studies broadly categorized into two types: behavior recognition and object detection and counting. For example, Guo et al. [6] introduced MCA-YOLOv5 for cross-domain free-range broiler monitoring, capable of recognizing 12 behaviors; however, in videos, it may suffer tracking failure under fast motion and confusion between similar behaviors when temporal changes are not considered. Nasiri et al. [7] proposed a video based broiler behavior monitoring system capable of detecting activities such as stretching and preening, but its robustness was limited under varying illumination and occlusion. In studies of duck flocks, Zhao et al. [8] proposed a multi-pose detection method based on HRNet (High-Resolution Network), which achieved good performance in most behavior recognition tasks but suffered from reduced accuracy in drinking and resting scenarios due to the occlusion of key body parts. Wang et al. [9] proposed YOLO-TransT, integrating an enhanced YOLOv8n detector with the TransT tracker for estrus-cow detection and tracking, but its performance can still be affected under noisy conditions such as extremely low light, backlight glare, and camera shake. Overall, deep learning demonstrates strong potential for real-time monitoring of livestock and poultry, but persistent challenges such as flock occlusion, scale variation, and blurred boundaries remain unresolved, particularly in high-density duck barns.
To facilitate deployment on mobile and embedded devices, researchers have increasingly focused on lightweight network architectures [10]. Representative works include MobileNet [11], which significantly reduces parameter counts using depthwise separable convolutions, and EfficientDet-Lite [12], which achieves efficient feature fusion through a scalable BiFPN (Bidirectional Feature Pyramid Network). In the poultry domain, Guo et al. [13] proposed a lightweight network for monitoring pigeon feather-cleaning behavior, which maintained high accuracy while reducing computational costs. Xiao et al. [14] presented a lightweight YOLOv8-based model for monitoring duck flock behaviors under bright and dark conditions, introducing a GhostNet backbone to accelerate inference speed with minimal accuracy loss. Chang et al. [15] developed an improved model for chick sex detection that achieved modest gains over YOLOv10n by incorporating a compact backbone (StarNet) and an improved detection head (GN Head). The most recent YOLOv11 has further advanced detection accuracy and efficiency, but its performance still declines under occlusion and blur in high-density duck barns. Therefore, improving the performance and efficiency of models in complex environments while reducing computational costs remains a significant challenge.
However, relying solely on the detection module remains insufficient to address tracking instability in high-density duck flocks, where individuals exhibit highly similar appearances, homogeneous feather textures, and frequent occlusion. Traditional multiobject tracking methods, such as Separate Detection and Embedding (SDE)-based DeepSORT and StrongSORT, perform well in general scenarios, but rely heavily on ReID (Re-identification) features, making it difficult to extract effective representations in duck flock environments [16,17]. In contrast, Joint Detection and Embedding (JDE) approaches offer advantages in computational efficiency but similarly depend on appearance consistency, often overlooking low-confidence targets and resulting in trajectory interruptions [18]. Therefore, although the multi-object tracking methods mentioned above have demonstrated powerful capabilities in various fields, they still have many shortcomings in solving the problem of tracking high-density farmed duck flocks.
This study aims to address the challenge of achieving lightweight, high-precision, and robust detection and tracking under conditions of occlusion and motion blur in high-density duck barns. From a practical farming perspective, the goal of tracking in this work is not to track individuals for its own sake, but to enable reliable flock monitoring and counting in dense duck houses. In high-density barns, individuals frequently overlap, temporarily disappear behind others, and reappear, making frame-by-frame counting based on detections alone prone to double-counting or missed individuals. By maintaining identity continuity over time, tracking provides a more stable basis for monitoring under transient occlusions while further reducing the need for human entry into the duck house. Importantly, our framework is designed to operate with existing fixed surveillance cameras commonly available on farms, which keeps additional hardware investment low and facilitates practical adoption. To this end, we propose DenseDuckMOT, a coupled detection–tracking framework tailored for complex farming environments. The main contributions of this work are summarized as follows:
(1) A lightweight backbone network, DuckNet, was designed on top of the YOLOv11 framework. By integrating BiFPN, GLSA (Global-Local Spatial Aggregation) and ESDH (Efficient Shared Detection Head), the model improves detection accuracy while keeping computational complexity under control.
(2) An AKFTrack (Adaptive Kalman Filter Track) tracker was developed by combining adaptive Kalman filtering with a two-stage matching strategy, thereby enhancing stability and robustness under high-density occlusion scenarios.
(3) The proposed DenseDuckMOT framework was validated in a real-world, high-density duck barn environment, demonstrating real-time, high-precision detection and stable multi-target tracking.
2. Materials and Methods
2.1. Data Acquisition and Preprocessing
The data for this study were collected at the Liancheng White Duck Breeding Farm in Liancheng County, Longyan City, Fujian Province, using 120-day-old Liancheng White Ducks. The recordings were obtained from two production batches raised in the same duck house under comparable management. Data collection was conducted from 15 to 30 August 2024, and from 9 to 10 December 2024, covering both summer and winter seasons to enhance dataset diversity. Local meteorological records near the barn indicated that, during the selected sampling hours from 17:00 to 06:00, the ambient air temperature ranged from 22.0 to 29.2 °C in August and from 10.2 to 18.1 °C in December, while relative humidity ranged from 72.9 to 97.9% and from 56.5 to 89.4%, respectively. The monitored area consisted of 20 pens, each housing about 24 ducks, resulting in approximately 480 ducks present in the barn area, with an average bodyweight of around 1.2 kg per duck.
To prevent stress responses caused by manual intervention, data were obtained using Dahua surveillance cameras fixed inside the duck house, installed at a height of 2 m and positioned 1.5 m from the fence. The surveillance videos were recorded at a resolution of 1920 × 1080, with a framerate of 25 fps, and stored in MP4 format. To ensure that sufficient individuals were captured in the frames, video segments recorded daily from 17:00 to 06:00 the following day were selected as sampling material. This sampling window was chosen because the farm adopts a semi-free-range management system, under which ducks are predominantly housed indoors during the evening-to-morning period, whereas daytime periods often involve partial outdoor access and reduced indoor occupancy, making the indoor surveillance footage less representative for stable flock. The data collection workflow is illustrated in Figure 1. First, surveillance videos were extracted from the local disk using the monitoring management platform. Second, one frame is sampled every 100 frames using the OpenCV library to construct a single-frame image dataset for detector training. Finally, the sampled images were subjected to preprocessing. As wide-angle lenses may cause barrel distortion at the image edges, an antidistortion algorithm was applied for correction. In addition, since some individuals could be occluded by fences, inverse perspective transformation was employed for geometric correction to reduce the impact of invalid regions. After screening, a total of 2416 images were obtained in JPG format.
Five representative surveillance video clips were selected for multi-object tracking evaluation and are referred to as Video 1 to Video 5 throughout this manuscript. All clips were recorded in the same duck house using the same fixed camera configuration, but they capture different time periods and activity levels, resulting in varying scene difficulty. Video 1 depicts a highly crowded flock with frequent overlap, where repeated occlusion and intermittent visibility can fragment trajectories. Video 2 shows a moderate-density scene with less persistent occlusion, providing a relatively easier case for association. Video 3 involves evident pose variation and local illumination fluctuation, which may reduce detection confidence and cause short-term association ambiguity. Video 4 is occlusion-heavy with frequent close-contact interactions, substantially increasing the risk of identity switches when multiple ducks move in proximity. Video 5 features a crowded scene with more dynamic motion, where rapid movement and overlap can trigger missed detections and disrupt trajectory continuity.
Due to the high stocking density, frequent individual movements, and complex lighting conditions, the raw images generally exhibited the issues shown in Figure 2.
Based on the findings of Jiang et al. [19], which demonstrated that full-body annotation of ducks outperforms head-only annotation, this study adopted full-body bounding box annotation. The annotation was independently carried out by annotators using the LabelImg tool, and discrepancies were resolved through cross-validation, producing .txt files containing class information and bounding box coordinates. The dataset was randomly split into training, validation, and test sets at an 8:1:1 ratio.
The ground-truth for multi-object tracking in this study was annotated using the Track mode of the CVAT (Computer Vision Annotation Tool) platform [20]. For each evaluation video, a unique trajectory ID was assigned to every duck. The Track function was used to propagate bounding boxes across frames along the temporal axis, and frame-by-frame manual corrections were applied to compensate for box shifts caused by occlusion, motion blur, or pose changes. This procedure provides a reliable annotation basis for the subsequent computation of HOTA, MOTA, IDF1 and other multi-object tracking metrics.
In highly clustered scenes with partial occlusion, the number of individuals was not inferred from isolated body parts. Instead, each duck was annotated as an instance-level bounding box using the most reliable visible cues, and ambiguous cases were resolved by consulting adjacent frames in the original video to maintain temporal consistency.
2.2. DuckNet
The YOLO family has become one of the most representative one-stage detection frameworks for tasks such as image classification, object detection, and instance segmentation. Among them, YOLOv11 [21] has demonstrated higher accuracy and faster inference speed on the COCO dataset [22], making it an important baseline for real-time poultry monitoring. In this study, we propose an improved model, DuckNet, whose overall architecture follows the paradigm of input–backbone–neck–detection head, as illustrated in Figure 3.
In the backbone, DuckNet integrates the C3K2, SPPF, and C2PSA modules, as illustrated in Figure 3. The C3K2 module is composed of 1 × 1 and 3 × 3 convolutions with residual connections, supporting variable convolution kernels to expand the receptive field and improve computational efficiency [23]. The SPPF module employs parameter-shared pooling operations to capture receptive fields of sizes 1, 5, 9, and 13, thereby enhancing the model’s robustness to scale variations [24]. The C2PSA module combines channel splitting with an attention mechanism to process original and refined features in parallel, thereby enabling multi-granularity feature fusion.
Building on this foundation, three further improvements are introduced: embedding the GLSA module into the backbone to balance long-range semantic representation with local detail expression; adopting BiFPN in the neck, which employs learnable path weights to enable adaptive cross-scale feature fusion; and designing an ESDH in the detection head, where convolution kernel parameters are shared across scales while scale-specific normalization layers are retained. Through these improvements, DuckNet effectively reduces computational redundancy while significantly enhancing detection accuracy under occlusion and scale variation. The design details of each module will be further elaborated in the following subsections.
2.2.1. Global-Local Spatial Aggregation
The baseline YOLOv11 model tends to encounter blurred individual boundaries, clustered group features, and semantic confusion when handling occlusion scenarios in high-density duck flocks. To enhance the model’s discriminative capability, we designed a GLSA module (as shown in Figure 4), which achieves the simultaneous enhancement of long-range semantic consistency and local boundary details through the collaborative optimization of global and local branches.
The input feature is first processed by a 1 × 1 convolution to reduce the dimensionality to d channels. Subsequently, a Softmax attention mechanism is applied along the spatial dimension to perform global contextual modeling and an MLP is employed to further enhance semantic representation. The final output is injected back into the original feature via a residual connection using channel-wise multiplication by default, with channel-wise addition as an optional choice, as shown in Equations (1)–(3)
where ⊗ represents matrix multiplication. The MLP (Multi Layer Perceptron) incorporates non-linear transformations with LayerNorm, and this design inherits the long-range dependency modeling capability of the Nonlocal network [25].
The input feature is first processed through a 1 × 1 convolution and a three-layer Depthwise Convolution (DWConv) residual structure to extract fine-grained features. Subsequently, a CBAM style spatial attention mechanism is employed, where channel-wise average pooling and max pooling are concatenated, followed by convolution fusion and a Sigmoid operation to generate the spatial attention map . This attention map is then element-wise multiplied with the feature maps, as shown in Equations (4)–(6).
where is element-wise multiplication and denotes a Depthwise Convolution residual unit.
Finally, the global feature and the local feature are concatenated along the channel dimension and compressed to d channels using a 1 × 1 convolution, as expressed in Equation (7):
Therefore, the GLSA module ensures semantic consistency at the global level while enhancing motion boundaries and spatial details at the local level, thereby effectively mitigating occlusion and blur issues in high-density scenarios.
2.2.2. Bi-Directional Feature Pyramid Network
To further integrate the GLSA-enhanced features across multiple scales while balancing semantic and detailed information, this study adopts a weighted bidirectional feature pyramid network (BiFPN) in the neck (as shown in Figure 5). Unlike the traditional FPN, which propagates features only in a top-down manner, BiFPN performs bidirectional fusion through both top-down and bottom-up pathways. In addition, learnable weights are introduced at each fusion node to adaptively allocate the contributions of features at different scales [26].
To avoid scale bias caused by naive summation, BiFPN introduces non-negative learnable weights for the input features at each fusion node and adopts a robust normalization strategy for weighted summation:
where is the initial learnable weight, is (to prevent the denominator from being zero), is the normalized weight of the corresponding feature map, and is the feature map of the i-th input.
In dense duck house scenarios, shallow features (P2–P3) contain abundant feather textures and individual edge information, making them more sensitive to occlusion, adhesion, and motion blur, whereas deep features (P4–P5) provide stable global semantics but lack fine-grained details. To address this, we explicitly introduce P2 detail backflow into the BiFPN and employ learnable weighted fusion to dynamically increase the relative importance of shallow details during training while preserving the semantic consistency from deeper layers. Complementary to the “global–local enhancement within a single scale” achieved by GLSA, the BiFPN enables joint optimization of “cross-scale semantic consistency and detail fidelity,” thereby significantly improving detection robustness and counting accuracy under high-density occlusion conditions.
2.2.3. Efficient Shared Detection Head
To balance lightweight design and detection accuracy in high-density duck house environments, we propose an Efficient Shared Detection Head (ESDH), whose structure is illustrated in Figure 6. This module takes three-scale features from the BiFPN as input, first performing channel alignment using a 1 × 1 convolution combined with Group Normalization to ensure dimensional consistency across different scales. Subsequently, all scale features share two layers of detail-enhancing deconvolution modules (DEConv_GN), which effectively reduce parameter redundancy while enhancing fine-grained representations such as edges and textures [27]. Unlike conventional YOLO detection heads that stack convolutional layers independently at each scale, this design achieves cross-scale parameter sharing through a unified structure, enabling the detection head to maintain lightweight characteristics while exhibiting stronger feature recovery capability.
At the output stage, the ESDH is divided into a regression branch and a classification branch. The regression branch employs Distribution Focal Loss (DFL), which models boundary offsets as discrete distributions. With reg_max = 16, the predicted distribution for each boundary e∈{l,t,r,b} is formulated as Equation (11):
The expectation of this distribution serves as the continuous boundary distance estimate, as shown in Equation (12):
Finally, by combining the anchor position with the current layer stride s, the decoded distances are transformed into the final bounding box prediction through the decoding function, as defined in Equation (13):
The classification branch predicts class probabilities through a 1 × 1 convolution, followed by a Sigmoid activation to generate the final classification outputs. To further enhance the consistency of predictions across different scales, both the regression and classification branches incorporate learnable scale factors (Scale), which dynamically adjust the prediction magnitude at each layer and thereby improve cross-scale accuracy balance.
Compared with conventional YOLO detection heads, the ESDH significantly reduces the parameter count through shared DEConv_GN modules, while integrating DFL and Scale mechanisms to refine bounding box regression and strengthen multi-scale detection consistency. This design demonstrates enhanced robustness and stability in dense duck-house environments characterized by severe occlusion and edge blurring, establishing ESDH as a critical component of the DenseDuckMOT framework for high-precision individual detection.
2.3. Adaptive Kalman Filter Track
Despite the high-precision object detection achieved by the DuckNet model, challenges remain in maintaining trajectory continuity and robustness, particularly under conditions of frequent occlusion, overlap, and low-confidence detections in dense duck house environments. To address these challenges, this study proposes AKFTrack, which takes the detection results of DuckNet as input and employs an Adaptive Kalman Filter (AKF) in combination with a two-stage matching strategy to achieve stable multi-object tracking. The synergy between DuckNet and AKFTrack constitutes the core of the DenseDuckMOT framework: DuckNet provides reliable detections, while AKFTrack guarantees the temporal continuity of trajectories. The overall workflow is illustrated in Figure 7. Notably, all tracking experiments in this study were conducted on the original continuous surveillance videos (25 fps), rather than on temporally downsampled frames. This is because temporal downsampling would increase inter-frame displacement and the probability that occlusions or crossings occur between frames, thereby aggravating data association ambiguity. Consequently, more detections would remain unmatched, leading to trajectory fragmentation or premature termination and thus increased tracking losses.
2.3.1. Adaptive Kalman Filter
In the state prediction stage, AKFTrack employs a Kalman filter to recursively estimate the target’s position and velocity. Unlike conventional methods, an adaptive noise adjustment mechanism is introduced, allowing for the prediction process to automatically correct uncertainty based on variations in target motion. The prediction equations are defined as (14)–(15):
where F denotes the state transition matrix and and represent the predicted covariance and the covariance at time step k − 1, respectively.
The process noise is modeled as a combination of a fixed component and an adaptive component, as defined in Equation (16):
Here, is dynamically adjusted according to variations in velocity and acceleration, thereby enhancing the stability of the filter under rapid motion and partial occlusion.
The state update process is defined as Equation (17):
Here, denotes the Kalman gain; represents the actual observation at time step k; and H is the observation matrix.
2.3.2. Two-Stage Matching Strategy
In the trajectory association stage, AKFTrack employs a two-stage matching mechanism. High-confidence detection boxes are first matched with predicted trajectories to ensure stable tracking of primary targets, while low-confidence and partially occluded detections are subsequently incorporated through supplementary matching to prevent trajectory loss. The matching cost function is defined as Equation (18),
where denotes the spatial overlap between a trajectory and a detection box; represents the detection confidence; and is a balancing factor.
While DuckNet guarantees detection accuracy, AKFTrack enhances temporal information modeling through adaptive Kalman filtering and two-stage matching, thereby reducing the risk of ID switches and trajectory loss. Together, these modules enable DenseDuckMOT to achieve stable and accurate white duck detection and counting in complex, high-density farming environments.
3. Results
For model training and testing, a high-performance computer workstation was used. This workstation-based setup was adopted to provide a reproducible and controlled benchmarking environment for fair comparison across methods. Windows 10 was used in our laboratory pipeline due to stable driver support and compatibility with the data acquisition and annotation workflow. The proposed framework is platform-agnostic and can be deployed on Linux-based systems as well. Detailed information on the GPU and configuration is provided in Table 1.
3.1. Evaluation Metrics
To compare the DuckNet proposed in this paper with other models, we primarily used mean Average Precision (mAP), a commonly used indicator for target detection, for evaluation. mAP is a metric that reflects the performance of the detection model. Its calculation formula is given in Equation (19).
Here, P represents the precision of the model, R represents the recall rate of the model, with their formulas defined in formulas (20) and (21), respectively. FP represents the number of falsely detected targets, FN represents the number of missed targets, and TP represents the number of correctly detected positive samples.
Additionally, we introduce the model’s F1 score (as shown in formula (22)), the parameter quantity, and the weight file size to address the false detection and missed detection issues, as well as to evaluate the model’s lightweight nature.
We evaluate multi-object tracking performance using HOTA, MOTA, IDR, IDF1 and IDSW, as defined in formulas (23)–(27). MOTA reflects overall tracking accuracy by jointly accounting for false positives, missed detections, and identity switches. HOTA captures detection, association, and localization quality in a single metric, providing a more balanced and comprehensive assessment of tracking performance. IDF1 is the F1-score computed over matched identities and reflects the ability of the tracker to maintain consistent IDs over time. IDSW denotes the number of identity switches, the number of times the predicted ID for a given ground-truth trajectory changes between consecutive frames; ideally, each target should retain a constant ID, so a lower IDSW indicates better tracking stability.
where denotes the association accuracy; TP is the number of true positive associations; FN is the number of false negative associations, i.e., ground-truth trajectories that remain unassociated; FP is the number of false positive associations, i.e., predictions that are incorrectly associated with a ground-truth trajectory; and FP, FN, and TP denote the numbers of false positive, false negative, and true positive detections, respectively.
where t indexes the video frame, denotes the number of ground-truth targets in frame t, and IDs denotes the number of identity switches occurring in that frame.
where IDTP denotes the number of identity true positives; IDFP denotes the number of identity false positives; and IDFN denotes the number of identity false negatives.
3.2. Comparative Experiments on DuckNet
To assess the effectiveness of neck network improvements in the DuckNet model, this experiment compared several existing variants, namely MAFPN [28], Slimneck [29], and AFPN [30]. All models were evaluated under identical datasets and training conditions, with the results presented in Figure 8.
YOLOv11-BiFPN-GLSA achieved the best overall performance, with 97.22% Precision, 97.00% Recall, and 92.58% [email protected], while requiring only 2.07 M parameters, 4460 KB model size, and 6.7 G FLOPs. YOLOv11-AFPN obtained the highest F1-score, whereas YOLOv11-Slimneck reached the fastest FPS, but showed weaker accuracy. Overall, BiFPN-GLSA provided the most favorable trade-off between precision and efficiency, confirming its practicality for real-world farming applications.
Figure 9 shows that in high-density farming environments with frequent occlusion, YOLOv11-Slimneck and YOLOv11-AFPN often produce false detections and localization errors, while YOLOv11-MAFPN shows partial improvement but remains unstable. In contrast, YOLOv11-BiFPN-GLSA consistently delivers accurate bounding boxes under occlusion, reducing false positives and demonstrating superior localization stability.
To further evaluate the impact of detection head structures on model performance, comparative experiments were conducted by integrating YOLOv11 with SEAMHead [31], LADH-Head [32], and ESDH (as illustrated in Table 2). YOLOv11-ESDH achieved the best overall results, with 97.92% Precision, 92.85% [email protected], 97.15% F1-score, and 96.38% Recall, while requiring only 2.26 M parameters and a 5070 KB weight file. It also reached 438.4 f/s, about 7.3% faster than LADH, demonstrating both efficiency and real-time capability. Overall, ESDH offers a superior balance of accuracy, complexity, and speed, highlighting its potential for real-time detection in dense farming environments.
Building on the detection head comparison, we evaluated model performance under motion blur (as shown in Figure 10). SEAMHead and LADH-Head showed frequent false and missed detections with poor localization, whereas ESDH significantly reduced errors and maintained higher accuracy and stability. These results confirm the superior robustness of ESDH in complex dynamic farming environments.
To evaluate the impact of different architectures on object detection performance, this study compared DuckNet with typical detection frameworks, including Faster-RCNN [33], SSD [34], and RT-DETR [35], as shown in Table 3. DuckNet achieved the best overall performance while remaining computationally efficient, as summarized in Table 3. DuckNet uses only 1.90 M parameters, with a compact model size of 4485 KB and 6.6 G FLOPs, while attaining 98.19% precision and 94.79% [email protected]. In contrast, Faster R-CNN requires 28.28 M parameters, 275.87 G FLOPs, and a much larger model size of 110,773 KB. SSD is lighter in computation with 3.68 M parameters and 5.67 G FLOPs, but its accuracy is notably lower, achieving 84.96% [email protected]. RT-DETR achieves comparable accuracy, but with substantially higher computation, using 19.87 M parameters, 56.9 G FLOPs, and a model size of 39,532 KB, which may limit real-time deployment on resource-constrained farm-premise devices.
To further validate the effectiveness of DuckNet, we compared it with several mainstream YOLO algorithms, including YOLOv5 [36], YOLOv6 [37], YOLOv8 [38], YOLOv10 [39], and YOLOv11, as shown in Table 4.
Table 4 further confirms the lightweight property of DuckNet from both model complexity and runtime efficiency. DuckNet achieves the best accuracy, reaching 98.19% precision and 94.79% [email protected] while keeping the parameter count at only 1.90 M and the computation at 6.6 G FLOPs. Under the same test settings, DuckNet runs at 303.4 frames per second, demonstrating that the improved accuracy is obtained without excessive computational overhead. Although some YOLO baselines exhibit higher framerates, they require larger parameter budgets or provide lower detection accuracy, indicating that DuckNet offers a more balanced trade-off between accuracy and efficiency for real-time farm surveillance scenarios.
Figure 11 shows that YOLOv5 and YOLOv8 frequently produced false detections, duplicate boxes, and unstable localization, especially under occlusion. YOLOv10 and YOLOv11 improved stability, but still missed targets. DuckNet maintained accurate and consistent predictions with minimal false or missed detections, demonstrating superior robustness and reliability under motion blur and occlusion.
3.3. Ablation Experiments of DuckNet
To examine the contribution of the neck components in DuckNet, we conducted an ablation study by enabling BiFPN and GLSA separately and jointly under identical training and evaluation settings. The quantitative results are summarized in Table 5. Removing both components yielded an [email protected] of 93.13% with 2.58 M parameters. Introducing BiFPN alone improved [email protected] to 93.54%, while GLSA alone resulted in 93.58%, indicating that each module provides consistent gains. When BiFPN and GLSA were combined, the performance reached the best overall level, and the parameter budget remained close to the lightweight regime, confirming that the two modules contribute complementary benefits while preserving model efficiency.
To further validate the effects of different module combinations, we conducted ablation experiments on the BiFPN, GLSA, and ESDH modules (as shown in Table 6). ESDH alone or BiFPN-GLSA improved accuracy, but integrating all three achieved the best results. This synergy enhanced detection accuracy while preserving lightweight design, confirming the effectiveness of the combined architecture.
3.4. Layer-CAM Visualization
To further evaluate the feature extraction capabilities of different models, this study employed LayerCAM [40] to visualize the regions of interest during the detection process (as shown in Figure 12). LayerCAM provides more fine-grained feature activations, thereby more accurately reflecting the attention regions of the models.
The original YOLOv11 showed scattered activations with background interference, while BiFPN or GLSA improved focus, but remained partly noisy. Their combination further concentrated attention, and ESDH enhanced contour localization. DuckNet achieved the most compact and accurate activations, focusing almost entirely on targets and suppressing background noise, confirming superior discriminability and robustness.
3.5. AKFTrack Comparison Experiment
Under the same detector and evaluation settings, AKFTrack, ByteTrack, DeepSORT, and StrongSORT were compared on five surveillance videos of white ducks, as shown in Figure 13. The aggregated results over six metrics, namely MOTA, IDF1, Recall, HOTA, IDR and IDSW, indicate that AKFTrack achieves overall superior and more stable tracking performance than ByteTrack, DeepSORT and StrongSORT. In terms of MOTA, IDF1 and Recall, AKFTrack ranks at the best or second-best level across all sequences. The advantage is particularly pronounced in video 1, video 4, and video 5, where targets are highly crowded and severely occluded: AKFTrack maintains higher MOTA and IDF1 than the other trackers, while Recall remains in a high range, indicating that it effectively suppresses false positives and missed detections while preserving detection recall.
From the radar plots of HOTA and IDR, the HOTA envelopes of AKFTrack across multiple sequences are slightly larger than those of the other methods, and IDR remains consistently high and aligned with Recall, suggesting that the algorithm better preserves trajectory consistency and completeness over the global temporal scale. Meanwhile, AKFTrack yields the lowest overall IDSW, especially on long-duration sequences such as video 4 and video 5, where the number of identity switches is markedly lower than for DeepSORT and StrongSORT, reflecting stronger robustness in maintaining target identities under frequent interactions and occlusions. Taken together, these multimetric comparisons suggest that the incorporation of adaptive Kalman prediction and association strategies enables AKFTrack to achieve higher multi-object tracking accuracy and more stable trajectory continuity in complex waterfowl farming scenarios.
Additional qualitative results are shown in Figure 14. Under occlusion and rapid target motion, AKFTrack is able to maintain trajectory continuity and avoids the tracking losses or identity switches that are commonly observed in other algorithms. These results indicate that, in complex farming environments, AKFTrack offers higher robustness and greater practical utility than existing methods.
4. Discussion
High-density duck house monitoring remains challenging because individuals frequently overlap, move rapidly, and exhibit a highly homogeneous appearance, while illumination is uneven and motion blur is common. Prior poultry vision studies have repeatedly shown that these factors reduce robustness in practical deployment, especially under fast motion, occlusion, partial visibility, and lighting variation [6,7,8,9]. Our results further confirm that reliable barn monitoring requires an end-to-end perspective. Improving single frame detection alone is not sufficient because missed and low confidence detections immediately propagate into tracking as trajectory breaks and unstable counting. DenseDuckMOT therefore targets continuity as a primary objective, where robust detections provide stable inputs and the tracker explicitly preserves identity and temporal consistency under occlusion and uncertain observations.
From a detection perspective, the cross-architecture and cross-version comparisons indicate that DuckNet achieves a practical balance between accuracy and efficiency on real surveillance imagery. Although two-stage and transformer-based detectors can be competitive, they typically incur larger computational cost and model footprints, which constrains farm deployment. In contrast, DuckNet maintains reliable localization and fewer missed detections in crowded and blurred frames while keeping the model compact, aligning with the general direction of parameter efficient design for embedded inference emphasized by lightweight backbones such as MobileNet [11]. The design choice is also consistent with the insight of EfficientDet that carefully structured feature fusion can strengthen multi-scale representation without prohibitive cost [26]. In our setting, efficiency is achieved by targeted improvements in feature representation that stabilize recall under crowding, rather than by aggressively shrinking network capacity at the expense of detection continuity.
The ablation results explain why the final configuration was adopted. While MAFPN yields slightly higher values on certain static metrics, the improved neck configuration provides more stable recall and stronger robustness under occlusion and crowding, which dominate dense barn monitoring. This observation is consistent with the finding that partial occlusion of information-rich body regions can directly degrade recognition quality and that robustness under incomplete visibility is critical for practical use [8]. For DenseDuckMOT, recall stability is particularly important because it reduces downstream association ambiguity and prevents avoidable trajectory fragmentation. The combined ablation further indicates that the final gains arise from complementary contributions of multi scale fusion and context enhancement, rather than from any single isolated module.
On the tracking side, AKFTrack shows stronger trajectory continuity than representative baselines in dense and homogeneous waterfowl scenes. In our evaluation sequences, the maximum number of ducks visible and tracked simultaneously reached 24, reflecting the practical capacity within a single pen under routine farming conditions. DeepSORT and StrongSORT rely heavily on appearance embeddings and re-identification cues, yet, in dense duck flocks, these cues are often weak because individuals share similar textures and colors and are frequently partially occluded. This makes appearance-driven association less reliable and increases identity switches and trajectory breaks. AKFTrack mitigates this limitation by strengthening motion-guided prediction through adaptive Kalman filtering and by adopting a two-stage association strategy that better utilizes uncertain detections. The observed improvements in standard tracking metrics and qualitative comparisons support the conclusion that motion-guided association and robust handling of low-confidence detections are more suitable than purely appearance-driven association for dense and homogeneous duck house monitoring.
Regarding real-time feasibility and deployment, DuckNet is not the fastest among extremely compact baselines because additional computation is introduced by feature fusion for robustness under occlusion and blur, and the shared head design does not always translate into peak throughput without operator level optimization. Nevertheless, the measured speed remains sufficient for typical surveillance streams recorded at 25 frames per second in our farm setting. On our workstation, DuckNet reaches 303.4 frames per second under the same input setting, corresponding to about 3.3 ms per frame for detection, which comfortably supports the real-time processing of a standard stream. For deployment, Raspberry Pi class devices can be feasible for reduced resolution or reduced framerate monitoring, but full-resolution real-time inference in crowded barns generally requires an edge device with sufficient computation, ideally equipped with GPU or NPU acceleration. Since the framework relies on the fixed cameras already commonly installed in barns, practical adoption mainly requires a local inference unit and standard video storage and networking, rather than intrusive sensors or animal-worn devices, which keeps deployment costs and operational disruption low.
5. Conclusions
This study proposes a coupled detection and tracking framework called DenseDuckMOT, specifically designed for high-density duck farms, enabling continuous contactless monitoring using fixed surveillance cameras. Key empirical findings indicate that, in real-world duck farms, the bottleneck is rarely the ability to detect ducks in a single image. Instead, the challenge lies in consistently monitoring each duck despite repeated interference from occlusion, overlap, and motion blur. For farm-scale tasks such as duck counting and routine inspections, maintaining trajectory continuity and minimizing identity confusion are more important than maximizing peak velocity under simplified conditions.
DenseDuckMOT addresses this need with a collaborative design. DuckNet is designed to remain compact while enhancing robustness in crowded scenes, ensuring stable detection results even when ducks are densely clustered or partially occluded. Building upon this, AKFTrack enhances temporal continuity using an adaptive motion prediction and a two-stage association strategy, reducing trajectory fragmentation when ducks temporarily disappear and re-enter the field of view. These components together form an end-to-end monitoring process that enables reliable camera-based monitoring without the need for implanted devices or invasive sensors on the animals. This reduces labor-intensive manual counting and mitigates the potential stress and biosecurity risks associated with repeated human entry into the duck house.
The method was validated on a commercial farm using 120-day-old Liancheng White ducks and tested under standard duck house conditions in summer and winter. We evaluated multi-object tracking using five representative monitoring video clips, demonstrating that the proposed framework is suitable for crowded and highly interactive real-world scenarios. Because the system relies on the fixed cameras common in modern duck houses, its deployment primarily requires only a local inference unit, standard video storage, and networking, enabling scalable deployment with minimal additional hardware investment.
This study is limited to a single farm, a single breed, a single age group, and a limited set of camera views. Performance may still degrade under conditions of extremely low light, prolonged severe occlusion, or compressed inter-animal spacing that restricts effective separation. Future work will expand the validation scope to include different farms, ages, stocking densities, and pen layouts, and extend the monitoring range to the entire growth cycle, thus clearly covering body size changes from ducklings to adults. We will also evaluate the system’s adaptability to other high-density poultry species with similar husbandry conditions and visual challenges, such as broilers, laying hens, turkeys, and geese, and further investigate its applicability to densely stocked small ruminants and swine in pen-based housing, including piglets and finishing pigs, as well as group-housed sheep. In addition, we will further optimize edge deployment using model compression and hardware acceleration, enabling embedded platforms with sufficient computing power to reliably run the framework in real-time.
In summary, DenseDuckMOT provides a practical and scalable foundation for welfare-friendly, low-cost visual monitoring, supporting safer, more efficient, and more sustainable duck farm management.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Anderson S. Animal Genetic Resources and Sustainable Livelihoods Ecol. Econ.20034533133910.1016/S 0921-8009(03)00088-0 · doi ↗
- 2Zhang Y. Wang L. Bian Y. Wang Z. Xu Q. Chang G. Chen G. Marginal Diversity Analysis of Conservation of Chinese Domestic Duck Breeds Sci. Rep.201991314110.1038/s 41598-019-49652-631511604 PMC 6739371 · doi ↗ · pubmed ↗
- 3Neethirajan S. The Role of Sensors, Big Data and Machine Learning in Modern Animal Farming Sens. Bio-Sens. Res.20202910036710.1016/j.sbsr.2020.100367 · doi ↗
- 4Mitin H. Idrus Z. Meng G.Y. Sazili A.Q. Awad E.A. Effects of Positive Human Contact on Fear and Physiological Stress Responses in Pekin Ducks (Anas Platyrhynchos Domesticus) Subjected to Crating and Transport Appl. Anim. Behav. Sci.202326910610810.1016/j.applanim.2023.106108 · doi ↗
- 5Bao J. Xie Q. Artificial Intelligence in Animal Farming: A Systematic Literature Review J. Clean. Prod.202233112995610.1016/j.jclepro.2021.129956 · doi ↗
- 6Guo Y. Wang J. Lin P. Yin C. Han Y. Multiple Behaviour Recognition of Free-Range Broilers in Cross-Domain Scenarios Using MCA-YOL Ov 5Biosyst. Eng.202525710422610.1016/j.biosystemseng.2025.104226 · doi ↗
- 7Nasiri A. Yoder J. Zhao Y. Hawkins S. Prado M. Gan H. Pose Estimation-Based Lameness Recognition in Broiler Using CNN-LSTM Network Comput. Electron. Agric.202219710693110.1016/j.compag.2022.106931 · doi ↗
- 8Zhao S. Bai Z. Meng L. Han G. Duan E. Pose Estimation and Behavior Classification of Jinling White Duck Based on Improved HR Net Animals 202313287810.3390/ani 1318287837760278 PMC 10525901 · doi ↗ · pubmed ↗
