Rethinking Training from Scratch for Object Detection

Yang Li; Hong Zhang; Yu Zhang

arXiv:2106.03112·cs.CV·June 8, 2021·6 cites

Rethinking Training from Scratch for Object Detection

Yang Li, Hong Zhang, Yu Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a direct pre-training method for object detection that uses low-resolution images on the target dataset, significantly speeding up training and improving accuracy without relying on ImageNet pre-training.

Contribution

It proposes a novel direct detection pre-training pipeline that utilizes low-resolution images, enabling faster training and better performance, applicable to both CNN and transformer backbones.

Findings

01

Pre-training accelerates by over 11x on COCO.

02

Achieves +1.8 mAP over ImageNet pre-training.

03

Applicable to transformer-based models like Swin Transformer.

Abstract

The ImageNet pre-training initialization is the de-facto standard for object detection. He et al. found it is possible to train detector from scratch(random initialization) while needing a longer training schedule with proper normalization technique. In this paper, we explore to directly pre-training on target dataset for object detection. Under this situation, we discover that the widely adopted large resizing strategy e.g. resize image to (1333, 800) is important for fine-tuning but it's not necessary for pre-training. Specifically, we propose a new training pipeline for object detection that follows `pre-training and fine-tuning', utilizing low resolution images within target dataset to pre-training detector then load it to fine-tuning with high resolution images. With this strategy, we can use batch normalization(BN) with large bath size during pre-training, it's also memory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wxzs5/direct-pretraining
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Stochastic Depth · Swin Transformer · Byte Pair Encoding · Adam · Label Smoothing