Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Individual-Level Livestock Monitoring and Longitudinal Visual Analytics
Haiyu Yang, Miel Hostens

TL;DR
This paper presents a lightweight, distilled vision model pipeline for edge-deployable livestock monitoring that maintains high accuracy while significantly reducing memory and computational requirements.
Contribution
It introduces a novel distillation approach for SAM 3 and DINOv3 models, enabling efficient on-device livestock monitoring and visual analytics.
Findings
Achieves 92.29% MOTA on Edinburgh Pig dataset
Reduces VRAM usage by 3-fold, enabling deployment on NVIDIA Jetson Orin NX
Maintains high classification accuracy of 97.34% top-1
Abstract
Foundation-model pipelines for individual-level livestock monitoring -- combining open-vocabulary detection, promptable video segmentation, and self-supervised visual embeddings -- have raised the accuracy ceiling of precision livestock farming (PLF), but their GPU memory budgets exceed the envelope of commodity edge accelerators. To close this gap, the 446M-parameter Perception Encoder (PE-ViT-L+) backbone of SAM 3 is distilled into a 40.66M-parameter multi-scale student through three mechanisms: a Feature Pyramid Network student encoder built on TinyViT-21M-512, a four-term direction-then-scale distillation loss, and backbone-substitution inference with sliding-window session pruning that bounds streaming GPU memory growth. The DINOv3 family includes a pre-distilled ViT-S/16 variant (21.6M parameters) released alongside a 6716M-parameter ViT-7B teacher; the ViT-S (21M) variant is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
