RelationNet++: Bridging Visual Representations for Object Detection via   Transformer Decoder

Cheng Chi; Fangyun Wei; Han Hu

arXiv:2010.15831·cs.CV·October 30, 2020·36 cites

RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder

Cheng Chi, Fangyun Wei, Han Hu

PDF

Open Access 4 Repos 1 Video

TL;DR

RelationNet++ introduces a novel attention-based decoder module that effectively integrates multiple object representation formats into standard detectors, significantly enhancing detection accuracy on COCO benchmarks.

Contribution

The paper proposes the bridging visual representations (BVR) module, enabling end-to-end integration of heterogeneous object representations into existing detectors, with novel efficient computation techniques.

Findings

01

Achieves 1.5 to 3.0 AP improvements across various detectors.

02

Reaches 52.7 AP on COCO test-dev with RelationNet++.

03

Demonstrates broad effectiveness in integrating multiple representations.

Abstract

Existing object detection frameworks are usually built on a single format of object/part representation, i.e., anchor/proposal rectangle boxes in RetinaNet and Faster R-CNN, center points in FCOS and RepPoints, and corner points in CornerNet. While these different representations usually drive the frameworks to perform well in different aspects, e.g., better classification or finer localization, it is in general difficult to combine these representations in a single framework to make good use of each strength, due to the heterogeneous or non-grid feature extraction by different representations. This paper presents an attention-based decoder module similar as that in Transformer~\cite{vaswani2017attention} to bridge other representations into a typical object detector built on a single representation format, in an end-to-end fashion. The other representations act as a set of \emph{key}…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsRepPoints · 1x1 Convolution · Residual Connection · Corner Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling · Softmax · Feature Pyramid Network · Convolution · Region Proposal Network