Supervised Virtual-to-Real Domain Adaptation for Object Detection Task   using YOLO

Akbar Satya Nugraha; Yudistira Novanto; Bayu Rahayudi

arXiv:2302.13891·cs.CV·February 28, 2023

Supervised Virtual-to-Real Domain Adaptation for Object Detection Task using YOLO

Akbar Satya Nugraha, Yudistira Novanto, Bayu Rahayudi

PDF

Open Access

TL;DR

This paper explores supervised domain adaptation from virtual to real data for object detection using YOLOv4, achieving high accuracy with limited real data by fine-tuning on backbone weights.

Contribution

It introduces a domain adaptation approach using virtual datasets and fine-tuning YOLOv4's backbone to improve real-world object detection performance.

Findings

01

Achieved 74.457% mAP with limited real data

02

Fine-tuning backbone weights enhances domain adaptation

03

Virtual datasets can effectively supplement real data

Abstract

Deep neural network shows excellent use in a lot of real-world tasks. One of the deep learning tasks is object detection. Well-annotated datasets will affect deep neural network accuracy. More data learned by deep neural networks will make the model more accurate. However, a well-annotated dataset is hard to find, especially in a specific domain. To overcome this, computer-generated data or virtual datasets are used. Researchers could generate many images with specific use cases also with its annotation. Research studies showed that virtual datasets could be used for object detection tasks. Nevertheless, with the usage of the virtual dataset, the model must adapt to real datasets, or the model must have domain adaptability features. We explored the domain adaptation inside the object detection model using a virtual dataset to overcome a few well-annotated datasets. We use VW-PPE…

Tables2

Table 1. Table 1 : mAP result from all testing scheme

Scheme	Total Sample Data	mAP
YR	220	0
YVR	5000	27.251
YVR	10000	51.369
YCVR	5000	65.513
	10000	72.264
	20000	59.691
YCSVR	5000	74.457
	10000	72.096
	20000	73.369
YCMVR	5000	55.010
YCMVR	10000	54.368
YCMSVR	5000	59.977
YCMSVR	10000	53.788

Table 2. Table 2 : Average precision of each class using best scheme

Class	AP
Head	84.052
Helmet	93.691
Ear Protection	42.292
Welding Mask	86.364
Bare Chest	59.159
High Visibility Vest	87.637
Person	51.457

Equations8

α = \frac{υ}{( 1 - I o U ) + υ ^{^{'}}}

α = \frac{υ}{( 1 - I o U ) + υ ^{^{'}}}

υ = \frac{4}{π ^{2}} (a r c t an \frac{w ^{g t}}{h ^{g t}} - a r c t an \frac{w}{h})^{2}

υ = \frac{4}{π ^{2}} (a r c t an \frac{w ^{g t}}{h ^{g t}} - a r c t an \frac{w}{h})^{2}

L_{C I o U} = [1 - I o U + \frac{ρ ( b , b ^{g t} )}{c ^{2}} + α υ]

L_{C I o U} = [1 - I o U + \frac{ρ ( b , b ^{g t} )}{c ^{2}} + α υ]

\begin{multlined}L_{total}=L_{CIoU}\\ -\sum_{i=0}^{S^{2}}\sum_{j=0}^{B}I_{ij}^{obj}\left[\hat{C_{i}}log(C_{i})+(1-\hat{C_{i}}log(1-C_{i}))\right]\\ -\lambda_{noobj}\sum_{i=0}^{S^{2}}\sum_{j=0}^{B}I_{ij}^{noobj}\left[\hat{C_{i}}log(C_{i})+(1-\hat{C_{i}}log(1-C_{i}))\right]\\ -\sum_{i=0}^{S^{2}}I_{ij}^{obj}\sum_{c\in classes}\left[\hat{p}_{i}(c)log(p_{i}(c))+(1-\hat{p}_{i}(c)log(1-p_{i}(c)))\right]\end{multlined}L_{total}=L_{CIoU}\\ -\sum_{i=0}^{S^{2}}\sum_{j=0}^{B}I_{ij}^{obj}\left[\hat{C_{i}}log(C_{i})+(1-\hat{C_{i}}log(1-C_{i}))\right]\\ -\lambda_{noobj}\sum_{i=0}^{S^{2}}\sum_{j=0}^{B}I_{ij}^{noobj}\left[\hat{C_{i}}log(C_{i})+(1-\hat{C_{i}}log(1-C_{i}))\right]\\ -\sum_{i=0}^{S^{2}}I_{ij}^{obj}\sum_{c\in classes}\left[\hat{p}_{i}(c)log(p_{i}(c))+(1-\hat{p}_{i}(c)log(1-p_{i}(c)))\right]

\begin{multlined}L_{total}=L_{CIoU}\\ -\sum_{i=0}^{S^{2}}\sum_{j=0}^{B}I_{ij}^{obj}\left[\hat{C_{i}}log(C_{i})+(1-\hat{C_{i}}log(1-C_{i}))\right]\\ -\lambda_{noobj}\sum_{i=0}^{S^{2}}\sum_{j=0}^{B}I_{ij}^{noobj}\left[\hat{C_{i}}log(C_{i})+(1-\hat{C_{i}}log(1-C_{i}))\right]\\ -\sum_{i=0}^{S^{2}}I_{ij}^{obj}\sum_{c\in classes}\left[\hat{p}_{i}(c)log(p_{i}(c))+(1-\hat{p}_{i}(c)log(1-p_{i}(c)))\right]\end{multlined}L_{total}=L_{CIoU}\\ -\sum_{i=0}^{S^{2}}\sum_{j=0}^{B}I_{ij}^{obj}\left[\hat{C_{i}}log(C_{i})+(1-\hat{C_{i}}log(1-C_{i}))\right]\\ -\lambda_{noobj}\sum_{i=0}^{S^{2}}\sum_{j=0}^{B}I_{ij}^{noobj}\left[\hat{C_{i}}log(C_{i})+(1-\hat{C_{i}}log(1-C_{i}))\right]\\ -\sum_{i=0}^{S^{2}}I_{ij}^{obj}\sum_{c\in classes}\left[\hat{p}_{i}(c)log(p_{i}(c))+(1-\hat{p}_{i}(c)log(1-p_{i}(c)))\right]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · COVID-19 diagnosis using AI · Domain Adaptation and Few-Shot Learning

MethodsCommunication--Guide||How Do I Communicate to Expedia? · (TravEL!!Guide)How Do I File a Claim with Expedia? · Feature Pyramid Network · Grid Sensitive · Tanh Activation · + ( 1 ) ⟷ 888 ⟷ ( 829 ) ⟷ 0881 How do I file a claim with Expedia? · Spatial Pyramid Pooling · Label Smoothing · Sigmoid Activation · Logistic Regression

Full text

Supervised Virtual-to-Real Domain Adaptation for Object Detection Task using YOLO

Abstract

Deep neural network shows excellent use in a lot of real-world tasks. One of the deep learning tasks is object detection. Well-annotated datasets will affect deep neural network accuracy. More data learned by deep neural networks will make the model more accurate. However, a well-annotated dataset is hard to find, especially in a specific domain. To overcome this, computer-generated data or virtual datasets are used. Researchers could generate many images with specific use cases also with its annotation. Research studies showed that virtual datasets could be used for object detection tasks. Nevertheless, with the usage of the virtual dataset, the model must adapt to real datasets, or the model must have domain adaptability features. We explored the domain adaptation inside the object detection model using a virtual dataset to overcome a few well-annotated datasets. We use VW-PPE dataset, using 5000 and 10000 virtual data and 220 real data. For model architecture, we used YOLOv4 using CSPDarknet53 as the backbone and PAN as the neck. The domain adaptation technique with fine-tuning only on backbone weight achieved a mean average precision of 74.457 %.

**Index Terms— ** YOLOv4, Object Detection, Virtual Dataset, Domain Adaptation, Personal Protective Equipment

1 Introduction

In the new spring of artificial intelligence, and more specifically in its subfield known as machine learning, a significant number of notable results have shown that usage of machine learning is viable for specific human-task, like object detection and object classification [1]. However, a remarkable result of machine learning is also affected by the availability of huge amounts of actual data and its label.

In the era of big data, having an availability of real input data to train machine learning algorithms is relatively easy for a wide range of applications. Several other fields, however, need more training data. Even though data is available, it must be manually revised to make it usable.

Making a dataset usable for training is complicated and needs technical knowledge, especially datasets, for object detection. This is because object detection is needed an object anchor as the label for every image. Training an anchor-based object detector with a sparsely annotated dataset can cause performance degradation [2].

A problem like the availability of datasets and the taxing process for revising data to make it usable force researchers to find another method. Among the methods of leveraging trained model, synthetic data, computer-generated datasets, or virtual datasets are used for pre-training dataset. Virtual datasets have been on the rise as they offer an abundant data scenario and correctly label it at a lower cost.

The downside of using a virtual dataset comes with a problem: cross-domain shift. Cross-domain object detection is challenging due to multi-level domain shift in an unseen domain [3]. Research has already been conducted and shown a few methods for solving cross-domain shifts. It varies from adding a domain-adapting layer [4] or creating a hierarchical domain-consistent network [3] to solving cross-domain shift problems for using virtual data.

This research investigates a domain adaptation strategy that maximizes the utilization of the virtual domain in the real-world domain. Hence, an object detection model needs fewer data for the real-world domain. Specifically, we demonstrate how the transfer learning approach on a well-known deep neural network can achieve state-of-the-art results in automatic visual media indexing after being trained with virtually generated images of people wearing safety gear, such as high-visibility jackets and helmets and domain adaptation using a few real image training examples.

2 Related Work

Object detection technologies achieved amazing accuracies with faster, unimaginable speeds a few years ago. Recently, YOLO [5] [6] [7] and RCNN [8] are de facto standard for object detection tasks. Most of the research on object detection is huge generic annotated datasets, such as Pascal [9], ImageNet [10], MS COCO[11], or OpenImages[12]. This dataset collects a large number of manually annotated web images.

With the need for huge amounts of data to reach reliable accuracy, virtually computer-generated or virtual datasets gained significant interest. Usage of virtual dataset begins from research to detecting pedestrians using the virtual dataset, which shows promising results with less than 2% derivation rate for detecting pedestrians [13]. The virtual dataset was also used to study trained CNNs to qualitatively and quantitatively analyze deep features [14].

The usage of data generated from the game was also explored in a few research. In [15], using 50000 labeled images from the GTA-V game trained on CNN shows that the mean squared error for lane distance estimation is considerably small for only a virtual dataset. In [16] shown using the unreal engine, an RCNN model could detect a sofa from a different viewpoint by only using a dataset generated from Unreal Engine 4

Dataset from GTA-V demonstrated that it is possible to reach excellent results on tasks such as real people tracking and pose estimation [17]. Using Faster R-CNN on virtual datasets and validating the result on the KITTI dataset also shows good results [18]. The virtual dataset could also be used to train a simple convolutional network to detect objects belonging to various classes in video [19].

Object detection models could also use the virtual dataset to achieve better accuracy. For example, in [15] using virtual dataset as SIM 10k into real dataset Cityscapes for car detection, resulting average precision 51.6%. Using 140.000 virtual images and just 220 images resulting an object detection model that could detect Personal Protective Equipment (PPE) with 76% accuracy [20]. Based on the research above, using a virtual dataset could create a better-accuracy model.

3 Methodology

3.1 Virtual Data

We used the VW-PPE dataset with over 140.000 virtual and 220 real images. The virtual images were generated using RAGE, the game engine for GTA-V, with each image having a width of 1088 and a height of 612, but for real images, each image has different width and height. VW-PPE dataset has seven object classes: Bare Head, Helmet, Ear Protection, Welding Mask, Bare Chest, High Visibility Vest, and Person. The virtual images have been generated in 10 locations of the game map, with three weather and time variations for each location. From 140000 virtual images, this research only used 5000 and 10000 using random sampling. Real images will be split by 50:50 for training and testing dataset. Sample images in the VW-PPE dataset are shown in Fig. 1.

3.2 YOLO

We used the architecture for object detection, You Only Look Once (YOLO), a one-stage detector that could do image localization and classification in one stage. We used it for the object detection task YOLOv4. YOLOv4 is used for this research because of the custom component that could be used in YOLOv4. Architecture YOLOv4 using CSPDarknet53 [21] as the backbone, PAN [22] as neck, and YOLOv3 detector layer [1] as the head.

To evaluate the performance of our implementation, we used Intersection over Union (IoU) based on the area of the detected (D) and real (V) bounding boxes, as well as Precision (Pr) and Recall (Rc). The confidence score associated with detected bounding boxes varies from 0 to 1. They are included in the output if their confidence score exceeds a user-defined threshold. Given the preceding criteria, the mean Average Precision (mAP) is calculated as the average of the highest precision at various recall settings.

3.3 Loss Function

To achieve robust detection from training a machine learning model, we used YOLOv4 loss function. The first component of YOLOv4 loss function is the Complete Intersection over Union (CIoU) loss formula to compute loss using x and y coordinates of the width and height of the bounding boxes [23].

[TABLE]

Inside CIoU formula, there are two variables, that is $\alpha$ of a positive trade-off parameter, explained in Equation 1 and $\upsilon$ of the consistency of aspect ratio, explained in Equation 2. The formula of $L_{CIoU}$ is explained in Equation 3.

[TABLE]

Inside Equation 4, second and third components were calculated as the confidence scores of objectness inside every grid cell. The variable of $I_{ij}^{noobj}$ and $I_{ij}^{obj}$ show the presence and absence of an object on that pixel, respectively. Value of $I_{ij}^{obj}$ will be 1 if there are objects in the grid cell, and $I_{ij}^{noobj}$ will be 1 if there is no object in the grid cell and 0 conversely. The variable of $C_{i}$ and $\hat{C}_{i}$ are confidence scores of ground truth and prediction of whether there is an object or not, respectively. At the last component, there are $\hat{p}_{i}$ and $p_{i}$ variables of actual and prediction class, respectively, for classification loss.

3.4 Domain adaptation

We proposed using domain adaptation by pre-training virtual datasets to solve cross-domain shift problems during tuning. Specifically, we apply the domain adaptation method to adapt pre-trained YOLO to our case. Our premise is that a pre-trained network contains sufficient knowledge for us to specialize it for a task using the transfer learning capabilities of deep neural networks and training sets generated from the virtual world.

The objective of transfer learning is to utilize the first already trained layers (i.e., those identifying low-level features) and update the final layers of the network to expand the detection capabilities to the new set of objects. With a trained deep convolutional neural network, its first layers have learned to identify increasingly complex features.

We used a domain adaptation scheme based on SHOT (Source Hypothesis Transfer) [4] in this experiment. We explained this in Fig. 3. For addressing the domain shift problem, we implemented the SHOT Domain Adaptation Scheme, where the last layer of the YOLO architecture utilized for detecting bounding boxes would be frozen. In addition to the weight of the freezing detecting layer, we will transfer the weight of the backbone and neck.

4 Experiments

This scheme is explained in Fig. 2. We trained 6 schemes, that is as below:

•

Training from scratch using real dataset only (YR)

•

Transfer learning from scratch (YVR)

•

Transfer learning with pre-trained weight (YCVR)

•

Transfer learning with domain adaptation scheme (YCSVR)

•

Transfer learning with mosaic augmentation and pre-trained weight (YCMVR)

•

Transfer learning with only backbone weight and mosaic augmentation (YCMSVR)

Based on Table 1, YR receives 0 mAP, since no detections achieved the confidence level. Utilizing 5000 sample data, the mAP for YVR hits 27.251. By using 10000 sample data, the mAP reaches 51.369. Using virtual datasets as source domains before transferring learning to real-world datasets is a promising strategy for boosting mAP in object detection tasks, as demonstrated by these results.

YCVR outperforms YVR, where the mAP for 5000 sample data is 65.515, and for 10000 sample data, the mAP is 72.264. Fine-tuning pre-trained weight, even if it is cross-domain, increases the mAP for the object identification model based on this finding.

With 5000 virtual sample data, YCSVR achieves the best mAP score of 74.457; while utilizing 10,000 virtual sample data, the mAP score hits 72.096. Based on these findings, it appears that transfer learning utilizing the SHOT Domain Adaptation Scheme will increase mAP, however, it will struggle when the proportion of virtual domain data is considerably higher than real domain data.

Lastly, with YCMVR and YCMSVR, it is demonstrated that mosaic augmentation decreases mAP. All YCMVR and YCMSVR tests reveal a mAP between 50 and 59, which is lower than YCVR.

Table 1 shows that mAP from YCSVR using 10.000 training data is lower than sample 5.000. This is because the sampling process is random. Although class distribution is in the same ratio, the image is still different. Fig. 4 shows that the real dataset is brighter than the 2 sample data in the virtual dataset. The issue with randomly sampling virtual datasets is that the average histogram color of each sampled virtual dataset will be darker than that of the actual dataset. Therefore, the domain shift problem can be caused by random sampling, which makes virtual datasets darker than real datasets.

Table 2 explains the average precision of every class using the best scheme, YCSVR. It shows that the helmet class has the highest average precision, and the ear protection class has the lowest average precision. This is because the helmet class has the most class label in the dataset, while ear protection has the fewest class label.

Using YCSVR models has a promising result for both bounding box prediction and class classification, as shown in Fig. 5.

5 Conclusion

Training a deep neural network in virtual environments has been proven to help when the number of available and usable training datasets is low. In this paper, we performed personal protective equipment object detection with a few real data/images. In our experiment, we trained YOLOv4 on the virtual dataset and tested it on a real dataset. In addition, we also fine-tune the deep neural network with small real data. Based on the experiment, we found that the performance of transfer learning only backbone weight is better than normal transfer learning. Moreover, we found that there are better choices than mosaic augmentation for training object detection cross-domain.

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Joseph Redmon and Ali Farhadi, “Yolov 3: An incremental improvement,” 2018.
2[2] Jihun Yoon, Seungbum Hong, and Min-Kook Choi, “Semi-supervised object detection with sparsely annotated dataset,” in 2021 IEEE International Conference on Image Processing (ICIP) , 2021, pp. 719–723.
3[3] Yuanyuan Liu, Ziyang Liu, Fang Fang, Zhanghua Fu, and Zhanlong Chen, “Hierarchical domain-consistent network for cross-domain object detection,” in 2021 IEEE International Conference on Image Processing (ICIP) , 2021, pp. 474–478.
4[4] Jian Liang, Dapeng Hu, and Jiashi Feng, “Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation,” 2020.
5[5] Joseph Redmon and Ali Farhadi, “Yolo 9000: Better, faster, stronger,” 2016.
6[6] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao, “Yolov 4: Optimal speed and accuracy of object detection,” Co RR , vol. abs/2004.10934, 2020.
7[7] Joseph Redmon and Ali Farhadi, “Yolov 3: An incremental improvement,” 2018.
8[8] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” 2015.