Transformer-based Context Condensation for Boosting Feature Pyramids in Object Detection
Zhe Chen, Jing Zhang, Yufei Xu, Dacheng Tao

TL;DR
This paper introduces a lightweight Transformer-based context condensation module that enhances feature pyramid fusion in object detection, improving accuracy and reducing computational costs across multiple detectors.
Contribution
It proposes a novel context modeling mechanism with local and global representations, integrated with a Transformer decoder, to boost feature fusion efficiency and effectiveness.
Findings
Improves detection accuracy by up to 7.8% AP on MS COCO
Reduces computational complexity by around 20% GFLOPs
Compatible with multiple feature pyramid methods
Abstract
Current object detectors typically have a feature pyramid (FP) module for multi-level feature fusion (MFF) which aims to mitigate the gap between features from different levels and form a comprehensive object representation to achieve better detection performance. However, they usually require heavy cross-level connections or iterative refinement to obtain better MFF result, making them complicated in structure and inefficient in computation. To address these issues, we propose a novel and efficient context modeling mechanism that can help existing FPs deliver better MFF results while reducing the computational costs effectively. In particular, we introduce a novel insight that comprehensive contexts can be decomposed and condensed into two types of representations for higher efficiency. The two representations include a locally concentrated representation and a globally summarized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Adam
