CvT-ASSD: Convolutional vision-Transformer Based Attentive Single Shot MultiBox Detector
Weiqiang Jin, Hang Yu, Hang Yu

TL;DR
This paper introduces CvT-ASSD, a novel object detection model combining convolutional vision transformers with an efficient single shot detector, achieving good accuracy and efficiency on large-scale datasets.
Contribution
The paper proposes CvT-ASSD, integrating convolutional vision transformers with an attentive single shot detector to improve detection accuracy and computational efficiency.
Findings
Achieves competitive detection performance on PASCAL VOC and MS COCO datasets.
Reduces computational complexity compared to traditional transformer-based detectors.
Demonstrates effective balance between accuracy and efficiency.
Abstract
Due to the success of Bidirectional Encoder Representations from Transformers (BERT) in natural language process (NLP), the multi-head attention transformer has been more and more prevalent in computer-vision researches (CV). However, it still remains a challenge for researchers to put forward complex tasks such as vision detection and semantic segmentation. Although multiple Transformer-Based architectures like DETR and ViT-FRCNN have been proposed to complete object detection task, they inevitably decreases discrimination accuracy and brings down computational efficiency caused by the enormous learning parameters and heavy computational complexity incurred by the traditional self-attention operation. In order to alleviate these issues, we present a novel object detection architecture, named Convolutional vision Transformer Based Attentive Single Shot MultiBox Detector (CvT-ASSD), that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Convolution · Label Smoothing · Adam · Dropout
