GAOC: A Gaussian Adaptive Ochiai Loss for Bounding Box Regression
Binbin Han, Qiang Tang, Jiuxu Song, Zheng Wang, Yi Yang

TL;DR
This paper introduces GAOC, a new loss function for object detection that improves accuracy by addressing scale and drift issues in bounding box regression.
Contribution
The novel GAOC loss function combines the Ochiai Coefficient with a Gaussian Adaptive distribution to enhance bounding box regression.
Findings
GAOC outperforms existing BBR loss functions on PASCAL VOC and MS COCO benchmarks.
GAOC improves detection robustness and accuracy by reducing sensitivity to positional deviations.
The method is scale-invariant and effectively addresses drift in bounding box regression.
Abstract
Bounding box regression (BBR) loss plays a critical role in object detection within computer vision. Existing BBR loss functions are typically based on the Intersection over Union (IoU) between predicted and ground truth boxes. However, these methods neither account for the effect of predicted box scale on regression nor effectively address the drift problem inherent in BBR. To overcome these limitations, this paper introduces a novel BBR loss function, termed Gaussian Adaptive Ochiai BBR loss (GAOC), which combines the Ochiai Coefficient (OC) with a Gaussian Adaptive (GA) distribution. The OC component normalizes by the square root of the product of bounding box dimensions, ensuring scale invariance. Meanwhile, the GA distribution models the distance between the top-left and bottom-right corners (TL/BR) coordinates of predicted and ground truth boxes, enabling a similarity measure that…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7- —Shaanxi Natural Science Basic Research Program
- —Xi’an Shiyou University, School of Electronic Engineering
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Video Surveillance and Tracking Methods · Infrared Target Detection Methodologies
1. Introduction
Object detection, a core computer vision task, is widely applied in autonomous driving, remote sensing, medical image analysis, and security surveillance [1,2]. Its objective is the simultaneous execution of object classification and localization. In this context, the bounding box regression (BBR) loss function is critical. It directly determines localization accuracy by measuring the alignment between predicted and ground truth boxes, thereby influences overall detection performance. Recent advances in deep learning have prompted the development of numerous Intersection over Union (IoU)-based loss functions [3], including GIoU, DIoU, CIoU, and EIoU. These functions mitigate issues such as gradient vanishing for non-overlapping boxes, center-point distance, and aspect ratio mismatch, thereby driving continual improvements in detection accuracy.
However, prevailing IoU-based loss functions suffer from two principal limitations. First, they frequently ignore the influence of bounding box scale on the regression process, resulting in suboptimal performance for small objects and in long-tail distributions. Second, these methods are often ineffective at mitigating the BBR drift problem. This problem arises when a significant positional deviation between the predicted and ground truth boxes prevents the loss function from providing sufficient gradient constraints. This inadequacy can cause slow convergence during initial training or unstable localization in later phases. Consequently, the development of a BBR loss function that ensures both scale invariance and robustness remains a pressing challenge in object detection research.
To overcome these limitations, this paper proposes a novel BBR loss function named Gaussian Adaptive Ochiai Loss (GAOC). GAOC integrates the properties of the Ochiai Coefficient (OC) and a Gaussian Adaptive (GA) distribution. The OC ensures scale invariance through normalization by the square root of the bounding box areas. Concurrently, the GA component models the top-left and bottom-right corners (TL/BR) coordinates with a Gaussian distribution, thereby reducing sensitivity to significant positional deviations. This combined approach enhances the loss function’s adaptability across object scales and improves its robustness and accuracy in complex detection scenarios. We integrated GAOC into the YOLOv5 [4] and RT-DETR [5], evaluate on PASCAL VOC [6,7] and MS COCO 2017 [8]. The results demonstrate that GAOC consistently surpasses existing IoU-based loss functions across multiple evaluation metrics, confirming its strong generality and effectiveness.
The primary contributions of this study are as follows:
- We propose a novel BBR loss based on the Ochiai Coefficient (OC). Compared to traditional IoU loss, OC emphasizes the balance of bounding box dimensions while assigning greater weight to the intersection of the two boxes.
- We introduce a Gaussian adaptive (GA) distribution for BBR loss, which improves robustness to positional variations by modeling TL/BR distances as a two-dimensional Gaussian distribution and computing similarity through GA.
- We validate the effectiveness of GAOC on public datasets, where it outperforms other BBR losses across multiple benchmarks.
The remainder of this paper is organized as follows. Section 2 reviews representative IoU-family bounding box regression losses and discusses their limitations. Section 3 presents the proposed GAOC loss, including its formulation, optimization properties, and implementation details. Section 4 describes the experimental settings and reports comprehensive comparisons and ablation studies on benchmark datasets. Finally, Section 5 concludes this work and outlines future directions.
2. Related Work
2.1. Object Detection
Object detection is a fundamental task in computer vision, which involves identifying objects in images and accurately determining their locations and categories. R-CNN [9] was a pioneering approach in this field, introducing selective search to generate candidate object regions. These regions were then processed by Convolutional Neural Networks (CNNs) for feature extraction, followed by object classification using Support Vector Machines (SVMs).
Building on R-CNN, a series of improved variants have been developed, including Fast R-CNN [10], Faster R-CNN [11], and Mask R-CNN [12], each achieving significant gains in both efficiency and accuracy. In parallel, the SSD [13] and YOLO [4,14,15,16] families have become widely adopted due to their balance of speed and accuracy, enabling real-time object detection within an end-to-end framework. More recently, algorithms such as CornerNet [17], CenterNet [18], and FCOS [19] have emerged, advancing object detection through novel architectural designs and innovative methodologies.
The DETR [5] family of algorithms, distinguished by its attention mechanism and Transformer-based architecture, introduces a novel paradigm for object detection. Despite their diverse design principles and implementations, object detection algorithms share a common challenge in BBR, a critical step in accurate object localization. Precise bounding box prediction enables these models to localize objects reliably within images, thereby providing a robust foundation for subsequent analysis and processing.
2.2. Bounding Box Regression Losses
Existing mainstream BBR losses are primarily built upon the IoU [3], defined as:
where B and denote the predicted and ground truth boxes, respectively. The corresponding loss is defined as:
IoU-based losses have become the dominant methodology for BBR.
The GIoU loss [20] mitigates the problem of gradient vanishing during bounding box updates when the predicted boxes and the ground truth are non-overlapping. The GIoU loss is defined as:
Here C represents the smallest convex hull that encloses both B and .
In contrast to GIoU, the DIoU [21] loss introduces an additional distance term into the IoU formulation, which minimizes the normalized distance between the centroids of the two bounding boxes, thereby enabling the algorithm to achieve faster convergence and improved performance. The DIoU is defined as:
where b and represent the centroids of B and , respectively. The term represents the Euclidean distance, and c denotes the diagonal of the smallest enclosing bounding box.
The CIoU [21] loss extends DIoU by incorporating an additional shape loss term that accounts for aspect ratio consistency. The CIoU is formally defined as follows:
where is the trade-off parameter given by:
and v quantifies aspect ratio consistency:
Here, and represent the width and height of the ground truth box, while w and h represent those of the predicted box. When the ground truth and predicted boxes share the same aspect ratio, CIoU simplifies to DIoU. The results of various losses are presented in Figure 1.
The EIoU [22] loss directly minimizes the normalized differences in width , height , and centroid position between the predicted and ground truth boxes. The EIoU loss is defined as:
where and represent the width and height of the smallest enclosing bounding box that contains both the predicted and ground truth boxes.
SIoU [23] extends IoU-based BBR by adding an angle-aware guidance, together with distance and shape penalties, to stabilize optimization and improve localization. It is formulated as
where and represent the distance and shape costs, respectively, and the distance term is reweighted by an angle-dependent factor.
When the width and height are identical, BBR optimization becomes infeasible. To address this issue and leverage the geometric properties of horizontal rectangles, a bounding box similarity metric, termed minimum-point distance IoU. MPDIoU was proposed [24] and is defined as follows:
where and are the distances between the TL and BR corners, respectively, and w and h denote the width and height of the image.
The WIoU [25] loss is based on a dynamic non-monotonic focusing mechanism. This mechanism utilizes the outlier degree to evaluate the quality of predicted boxes and designs a gradient gain allocation strategy. The strategy reduces the dominance of high-quality predicted boxes while mitigating the adverse gradients from low-quality examples, thereby enabling WIoU to focus on medium-quality predicted boxes and enhance the detector’s overall performance. WIoU is defined as follows:
where significantly amplifies the of ordinary-quality predicted boxes ( ). When predicted box and the ground truth overlap well, significantly reduces for high-quality predicted boxes, with a focus on the centroid distance. Here, and denote the dimensions of the minimum enclosing boxes.
2.3. Limitations
Despite substantial progress in IoU-based BBR losses, practical limitations remain. GIoU, DIoU, CIoU primarily improve overlap and center and aspect constraints, but can be less effective under large displacement, extreme aspect ratios, or long-tail scales. EIoU and SIoU further refine geometric penalties, yet introduce additional design complexity and potential sensitivity to hyperparameters. MPDIoU emphasizes point-wise geometry but does not explicitly improve scale robustness, while WIoU mainly reweights samples without changing the underlying overlap similarity, leaving limited guidance when overlap is low. Motivated by these gaps, we propose the GAOC, which enhances scale robustness via Ochiai-style normalization and strengthens geometric guidance through Gaussian-adaptive corner modeling for more stable regression under scale variation and large displacement.
3. Method
3.1. Simulation Experiment
This study employs a simulation experiment proposed in the CIoU [21] study to evaluate BBR performance under different loss conditions. The simulation generates seven target boxes (aspect ratios 1:4, 1:3, 1:2, 1:1, 2:1, 3:1, and 4:1; all with area ) centered at the coordinate . Twenty thousand anchor points are generated in a circular area of radius r, centered at the same location, and for each anchor point, 49 anchor boxes (aspect ratios , , , , , , ) are defined. Each anchor box is regressed toward the target box.
To compare convergence speeds at different stages, this study adopts the following experimental settings: for , anchor points are distributed both inside and outside the target box (Figure 2, left), representing the full range of BBR scenarios; for , anchor points lie within the target box (Figure 2, right), representing the primary BBR scenarios.
3.2. Ochiai Coefficient Loss
3.2.1. Loss Forward
To overcome the scale sensitivity of IoU-style normalization, this study proposes a novel Ochiai Coefficient (OC) loss, defined as follows. For each pixel in the image, the ground truth is represented as a four-dimensional vector: where , , , and represent the distances from pixel to the top, bottom, left, and right boundaries of the image, respectively. The predicted box is similarly defined as , as illustrated in Figure 3.
For OC loss, given a predicted box and a ground truth , the loss is defined as:
In Algorithm 1, indicates whether pixel lies within a valid target box. and represent the areas of the predicted and ground truth boxes, respectively. and represent the height and width of the intersection region I. Since , we adopt the negative log-likelihood . This formulation yields a scale invariant similarity that emphasizes the intersection while balancing box sizes, thereby facilitating more accurate bounding box prediction.
Moreover, this definition naturally normalizes OC to independently of the bounding box scale. OC loss emphasizes the weight of shared elements between boxes by considering the balance between predicted box and ground truth sizes. Finally, OC accounts for set size by normalizing with the square root of the product, which renders it invariant to set size. Algorithm 1 OC loss Forward
- 1:Input: as ground truth, as predicted
- 2:Output: L as localization error each pixel
- 3:
- 4:
- 5:
- 6:
- 7:
- 8:
- 9:
- 10:
- 11:if not valid then,
3.2.2. Loss Backward
For the OC loss backpropagation, the partial derivative of X with respect to x (denoted , where ) is computed first:
Next, the partial derivative of I with respect to x (denoted ) is derived:
Finally, the gradient of the OC localization loss with respect to x is given by:
The OC loss mechanism is best understood through its mathematical formulation: represents the penalty term associated with the predicted box, which is directly proportional to the loss gradient; represents the penalty term associated with the intersection region, which is inversely proportional to the loss gradient. Consequently, minimizing the OC loss requires maximizing the intersection region while simultaneously minimizing the predicted box volume. The limiting case arises when the intersection region equals the predicted box, yielding perfect bounding box alignment.
3.3. Gaussian Adaptive Loss
During training, the predicted box in Algorithm 2 is optimized to minimise the loss with respect to the ground truth . Specifically, the TL and BR coordinates of the predicted box B are obtained through a transformation , while the ground truth is defined by the coordinates . Predicted boxes are clustered around the ground truth, with the distance from each predicted box to the ground truth modelled as a Gaussian distribution (GA), as illustrated in Figure 4. Algorithm 2 GAOC as Bounding Box Loss
- 1:Input Predicted , ground truth , width and height of input image: .
- 2:Output
- 3:For the predicted box , ensuring and .
- 4:
- 5:
- 6:
- 7:
- 8:Calculating area of
- 9:Calculating area of
- 10:Calculating intersection I between and :
- 11:
- 12:
- 13:
- 14:
- 15:
- 16:
- 17:
Consider a two-dimensional Gaussian distribution for the top-left corner coordinates of the predicted and ground truth boxes, denoted , where the mean vector is The vectors are mutually orthogonal, thus the correlation coefficient is . Treating the x-axis and y-axis distances as equivalent, the variance matrix is given by:
The two-dimensional Gaussian distribution for distance is:
where is the squared Euclidean distance between the TL corners of the predicted and ground truth boxes.
Similarly, the squared Euclidean distance between the BR corners of the predicted and ground truth boxes is:
with the corresponding Gaussian distribution:
Here, the values of and are both set to 2, yielding the GA loss:
The backward of the GA term can be expressed as:
The GAOC loss is formulated as:
Thus, the loss is expressed as:
4. Experiments
The objective of this study is to enhance the ability of object detection algorithms to accurately identify objects of diverse sizes and shapes. To this end, we propose the Gaussian Adaptive BBR Loss (GAOC), a novel loss function specifically designed to optimize localisation and classification in object detection. By integrating GAOC into the YOLOv5 and RT-DETR object detection frameworks and conducting extensive experiments on the PASCAL VOC and MS COCO 2017 benchmark datasets, we demonstrate that GAOC achieves superior performance in multi-scale object detection tasks.
4.1. MS COCO 2017 and PASCAL VOC
The Common Objects in Context (COCO) dataset is a large-scale benchmark developed by Microsoft in collaboration with research institutions. It contains over 330,000 images featuring diverse objects and complex backgrounds and is widely used for computer vision tasks such as object detection, semantic segmentation, and instance-level annotation. Each image is annotated with precise bounding boxes, pixel-level segmentation masks, and corresponding semantic labels. The COCO dataset defines 80 object categories, covering common classes such as humans, animals, and vehicles, while also providing scene-level annotations for background contexts. In the COCO 2017 release, the Train2017 subset (118,287 images) is used for model training, the Val2017 subset (5000 images) for validation, and the Test2017 subset (20,288 images) for performance evaluation and benchmarking.
The PASCAL Visual Object Classes (PASCAL VOC) dataset is a widely used benchmark for object detection, image classification, and semantic segmentation. In this study, the VOC2007 and VOC2012 datasets are integrated to form a combined training set of 21,503 images and a test set of 4952 images, covering a total of 20 object categories.
4.2. Experimental Setup
The experimental setup of this study is described as follows. Owing to their larger network architectures and higher parameter counts, YOLOv5X and RT-DETR were applied to complex scenarios in the COCO dataset, whereas YOLOv5L was selected for the PASCAL VOC dataset. Ablation experiments were conducted to assess the performance of RT-DETR on the PASCAL VOC dataset. Training employed stochastic gradient descent (SGD) as the optimizer, with a learning rate of 0.01, momentum of 0.937, and weight decay of 0.0005. Data augmentation techniques included random flipping, rotation, translation, mosaic stitching, and image blending. The label smoothing factor was set to 0.2, and the input image size was fixed at 640 × 640 pixels.
4.3. Experimental Analysis
As shown in Figure 5a, under identical initialization conditions, both IoU and OC losses decrease monotonically as training iterations increase, and they reach a plateau after about 50 to 70 epochs, reflecting the stability of the overall optimization process. However, the two curves ultimately converge to loss levels of 0.804 and 0.788, indicating that their localization performance remains limited. By contrast, the OC loss demonstrates a faster descent and a lower convergence value, implying that in cases of minor overlaps or partial scale mismatches, OC provides smoother gradients and thus achieves more effective loss suppression. These results reveal that relying solely on intersection-based metrics (IoU or OC) is insufficient to provide adequate geometric guidance in scenarios with large displacements or significant shape differences.
Based on the results in Figure 5b,c and Table 1, we systematically compared multiple BBR loss functions. As shown in Figure 5b,c, all methods exhibit a stable downward trend in their training curves. However, they differ significantly in convergence speed and final residual values. In terms of curve morphology, WIoU, SIoU, and GAOC decline most rapidly within the 40 to 60 epochs and quickly enter a low loss plateau. GIoU converges the slowest and retains the highest residual error. EIoU descends faster in the mid-to-late training phase but still ends with relatively high residuals. A closer comparison between GAOC (GA + OC) and GAI (GA + IoU) shows that both follow a smooth, S-shaped convergence trajectory. GAOC achieves a lower final value and a more stable plateau. This suggests that it provides effective gradient guidance in both the early geometric alignment and the later fine-tuning stages of localization.
From Table 1, represents the overall average localization quality, whereas reflects the lower bound performance on the most challenging samples. Higher values indicate greater robustness. GAOC achieves the best results on both (0.963) and (0.956). This demonstrates improvements not only in overall accuracy but also in worst-case long tail performance. GAI (0.959/0.953) yields comparable results. WIoU achieves a higher (0.950) than MPDIoU (0.942), indicating stronger robustness. DIoU and CIoU perform at moderate levels. EIoU shows competitive average performance (0.958) but a very low (0.51), revealing high vulnerability to long tail cases. GIoU performs the weakest across both metrics (0.872/0.571).
Mechanistically, GAOC and GAI integrate overlap measures with Gaussian corner-based geometric constraints. These constraints yield nonvanishing, directionally informative gradients even under zero overlap and weak overlap conditions. This design enables fast convergence, low residual loss, and strong long tail robustness. WIoU mitigates noisy gradient effects through hard sample adaptive weighting, thereby stabilizing its tail performance. In contrast, EIoU suffers from gradient imbalance when dealing with extreme shapes and large displacements. This results in degraded tail performance. In summary, GAOC achieves the best overall performance in this simulation study, excelling in convergence speed, final accuracy, and long tail robustness.
Table 2 and Table 3 present the experimental results on the COCO 2017 and PASCAL VOC datasets, respectively. A comparison of different BBR losses (including CIoU, DIoU, EIoU, GIoU, SIoU, WIoU, and MPDIoU) against GAOC yields the following conclusions. On the COCO 2017 dataset, GAOC outperformed all competing methods in mAP, mAP75, and mAP50. Specifically, GAOC achieved an mAP of 46.2%, representing a 1.6% improvement over the best-performing competitor, WIoU (44.6%), thereby demonstrating a clear advantage in detection accuracy. GAOC also achieved an mAP50 of 64.5%, corresponding to a 3.2% increase over WIoU and substantial improvements compared to other methods.
On the PASCAL VOC dataset, GAOC achieved an mAP50 of 79.0%, further confirming its superior performance. The consistent improvements across both datasets highlight GAOC’s robust generalization capability: it sustains high detection accuracy in the complex, real-world scenes of COCO 2017 as well as in the standardized image settings of PASCAL VOC. Figure 6 and Figure 7 illustrate qualitative detection results, further supporting the superior performance of GAOC compared with alternative loss functions.
Table 4 presents the performance of various BBR losses functions on the MS COCO dataset using RT-DETR as the baseline, with GAOC demonstrating clear advantages across key evaluation metrics. For the mAP50 metric, GAOC achieved the highest score of 65.3%, surpassing the second-ranked CIoU (64.9%) by 0.4%. This result indicates that GAOC delivers superior object detection accuracy under relaxed matching criteria. GAOC also maintained the leading position in mAP, achieving a score of 47.8%. Overall, in experiments on the MS COCO dataset with RT-DETR as the benchmark, GAOC consistently outperformed other mainstream BBR losses functions (including CIoU, DIoU, and EIoU) across both mAP50 and mAP metrics. These findings confirm that GAOC more effectively optimizes BBR in object detection, improves detection accuracy, and provides superior overall performance.
The design philosophy of GAOC is grounded in a deep understanding of the intrinsic characteristics of the BBR problem. By introducing a falloff coefficient, GAOC assigns greater weight to the intersection between predicted and ground truth boxes in the loss function, thereby improving detection performance for multi-scale objects. In addition, the incorporation of a Gaussian adaptive mechanism increases the algorithm’s robustness to variations in target position by modeling the TL/BR coordinates of bounding boxes as a two-dimensional Gaussian distribution.
Synthesizing the experimental results with the methodological analysis, we conclude that the GAOC loss function demonstrates outstanding performance in object detection tasks. The combination of its innovative design and consistent empirical outcomes validates GAOC as a novel BBR loss function with significant potential to advance the field of object detection. Future research could explore applying GAOC to diverse scenarios and tasks, as well as integrating it with other state-of-the-art (SOTA) algorithms.
4.4. Ablation Study
When comparing the OC and IoU loss functions, as shown in Table 5, OC consistently outperforms IoU across both mAP50 and mAP. By normalizing the square root of bounding box dimensions’ product, OC achieves scale invariance and enables more accurate similarity measurement between predicted and ground truth boxes, thereby delivering superior performance in object detection tasks.
The comparison between GAI and GAOC shows that GAOC surpasses GAI by 1.2% and 0.5% in mAP50 and mAP, respectively. This result validates the rationale for integrating GA with OC. The GA mechanism reduces sensitivity to positional deviations by modeling the TL/BR coordinates of bounding boxes as a two-dimensional Gaussian distribution. The scale invariance of OC further complements this approach. In contrast, GAI’s reliance on IoU limits its optimization capacity, and even when combined with GA, it fails to match the comprehensive performance of GAOC. These findings suggest that GAOC adopts a more effective design strategy for BBR optimization, better addressing the challenges of complex scenes and multi-scale object detection.
5. Conclusions
This study proposed the GAOC loss function. Experimental results on the COCO 2017 and PASCAL VOC benchmark datasets demonstrated that GAOC delivers significant performance improvements in object detection tasks. GAOC enhances the detector’s capability for multi-scale object detection and reduces sensitivity to positional bias. It achieves this by assigning greater weight to the intersection of predicted and ground truth boxes and implementing point-to-point coordinate alignment. These experiments not only validate the effectiveness of GAOC but also highlight its strong potential for practical applications. Future work may extend GAOC to related domains, such as instance segmentation.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Tang Q. Su C. Tian Y. Zhao S. Yang K. Hao W. Feng X. Xie M. YOLO-SS: Optimizing YOLO for enhanced small object detection in remote sensing imagery J. Supercomput.20258130310.1007/s 11227-024-06765-8 · doi ↗
- 2Xie M. Tang Q. Tian Y. Feng X. Shi H. Hao W. DCN-YOLO: A Small-Object Detection Paradigm for Remote Sensing Imagery Leveraging Dilated Convolutional Networks Sensors 202525224110.3390/s 2507224140218753 PMC 11991083 · doi ↗ · pubmed ↗
- 3Yu J. Jiang Y. Wang Z. Cao Z. Huang T. Unitbox: An advanced object detection network Proceedings of the 24th ACM International Conference on Multimedia Amsterdam, The Netherlands 15–19 October 2016516520
- 4Jocher G. Ultralytics YOL Ov 52020 Available online: https://github.com/ultralytics/yolov 5(accessed on 4 January 2026)
- 5Zhao Y. Lv W. Xu S. Wei J. Wang G. Dang Q. Liu Y. Chen J. Detrs beat yolos on real-time object detection Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Seattle, WA, USA 16–22 June 20241696516974
- 6Everingham M. Eslami S.A. Van Gool L. Williams C.K. Winn J. Zisserman A. The pascal visual object classes challenge: A retrospective Int. J. Comput. Vis.20151119813610.1007/s 11263-014-0733-5 · doi ↗
- 7Everingham M. Gool L.V. Williams C.K.I. Winn J. Zisserman A. The pascal visual object classes (voc) challenge Int. J. Comput. Vis.20108830333810.1007/s 11263-009-0275-4 · doi ↗
- 8Lin T.Y. Maire M. Belongie S. Hays J. Perona P. Ramanan D. Dollár P. Zitnick C.L. Microsoft coco: Common objects in context Proceedings of the Computer Vision–ECCV 2014: 13th European Conference Zurich, Switzerland 6–12 September 2014 Proceedings, Part V 13Springer Cham, Switzerland 2014740755
