DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications

Malaisree P; Youwai S; Kitkobsin T; Janrungautai S; Amorndechaphon D; Rojanavasu P

arXiv:2510.25140·cs.CV·November 3, 2025

DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications

Malaisree P, Youwai S, Kitkobsin T, Janrungautai S, Amorndechaphon D, Rojanavasu P

PDF

TL;DR

DINO-YOLO is a hybrid self-supervised object detection architecture that significantly improves accuracy in civil engineering applications with limited data, while maintaining real-time inference speeds.

Contribution

The paper introduces DINO-YOLO, combining YOLOv12 with DINOv3 transformers, and demonstrates its effectiveness in data-efficient detection for civil engineering tasks.

Findings

01

Achieves up to 88.6% improvement in detection accuracy on KITTI dataset.

02

Maintains real-time inference at 30-47 FPS despite added complexity.

03

Optimal performance with medium-scale architectures and dual integration points.

Abstract

Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO-YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient detection. DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid-backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improvement, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real-time inference (30-47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% [email protected]), while Small-scale requires Triple Integration (53.63%). The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.