# VM-RTDETR: Advancing DETR with Vision State-Space Duality and Multi-Scale Fusion for Robust Pig Detection

**Authors:** Wangli Hao, Shu-Ai Xu, Hao Shu, Hanwei Li, Meng Han, Fuzhong Li, Yanhong Liu

PMC · DOI: 10.3390/ani15223328 · 2025-11-18

## TL;DR

This paper introduces VM-RTDETR, a new object detection model that improves pig detection in farming by combining global and local image features.

## Contribution

VM-RTDETR introduces a Vision State-Space Duality backbone and a Multi-Scale Encoder for robust pig detection in complex environments.

## Key findings

- VM-RTDETR outperforms RT-DETR by up to 2.35% in average precision on a pig farm dataset.
- The model effectively handles scale changes, occlusions, and complex backgrounds in livestock monitoring.
- The VSSD and M-Encoder combination achieves more comprehensive feature representation for detection.

## Abstract

Robust pig detection in complex farming environments requires a unified representation of both global semantics and local details, which remains a challenge. This paper proposes VM-RTDETR, an enhanced RT-DETR (transformer-based real-time object detector) model that addresses this by synergizing a Vision State-Space Duality (VSSD) backbone with a Multi-scale Encoder (M-Encoder). The VSSD module breaks through the causal constraints of traditional state-space models (efficiently capturing long-range dependencies and global context within an image) to capture long-range dependencies and global context, while the M-Encoder extracts parallel multi-scale features to handle appearance variations. This collaboration yields a detector that robustly handles scale changes, occlusions, and complex backgrounds. On challenging datasets, VM-RTDETR elevates the state of the art, surpassing strong baselines like RT-DETR by significant margins. It provides a reliable and efficient vision solution for automated livestock monitoring.

Pig detection is a fundamental yet challenging task in intelligent livestock farming, primarily due to difficulties in capturing both global contextual information and multi-scale features within complex environments. To address this, we propose VM-RTDETR, a novel detection model based on an enhanced RT-DETR architecture. The model incorporates a Vision State-Space Duality (VSSD) backbone, leveraging a novel Non-Causal State-Space Duality (NC-SSD) mechanism to overcome the limitations of traditional SSMs by enabling efficient modeling of long-range dependencies and global context. Furthermore, we design a Multi-Scale Efficient Hybrid Encoder (M-Encoder) that employs parallel convolutional kernels to capture both local details and global contours, effectively addressing scale variations. The synergistic design of the VSSD backbone and the M-Encoder enables our model to achieve more comprehensive feature representation. Experimental results on a custom dataset of 8070 images from a pig farm (with 6955 images for training and 1115 for testing) demonstrate that VM-RTDETR significantly outperforms existing mainstream detectors, improving AP, AP50, and AP75 by up to 2.35%, 0.63%, and 2.76%, respectively, over the strong R50-RTDETR baseline. Our model significantly enhances detection robustness in complex scenarios, offering an efficient and accurate solution for intelligent livestock farming.

## Full-text entities

- **Species:** Sus scrofa (pig, species) [taxon 9823]

## Figures

16 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12649245/full.md

---
Source: https://tomesphere.com/paper/PMC12649245