Cross Resolution Encoding-Decoding For Detection Transformers
Ashish Kumar, Jaesik Park

TL;DR
This paper introduces CRED, a novel mechanism enabling DETR-based object detection models to achieve high accuracy at reduced computational cost by efficiently fusing multiscale features through cross-resolution attention modules.
Contribution
The paper proposes the Cross-Resolution Encoding-Decoding (CRED) mechanism with CRAM and OSMA modules, allowing DETR to perform multiscale detection efficiently with fewer FLOPs and higher speed.
Findings
CRED achieves high-resolution accuracy with 50% fewer FLOPs.
CRED-DETR is 76% faster than high-resolution DETR on MS-COCO.
CRED reduces FLOPs by approximately 50% while maintaining accuracy.
Abstract
Detection Transformers (DETR) are renowned object detection pipelines, however computationally efficient multiscale detection using DETR is still challenging. In this paper, we propose a Cross-Resolution Encoding-Decoding (CRED) mechanism that allows DETR to achieve the accuracy of high-resolution detection while having the speed of low-resolution detection. CRED is based on two modules; Cross Resolution Attention Module (CRAM) and One Step Multiscale Attention (OSMA). CRAM is designed to transfer the knowledge of low-resolution encoder output to a high-resolution feature. While OSMA is designed to fuse multiscale features in a single step and produce a feature map of a desired resolution enriched with multiscale information. When used in prominent DETR methods, CRED delivers accuracy similar to the high-resolution DETR counterpart in roughly 50% fewer FLOPs. Specifically,…
Peer Reviews
Decision·Submitted to ICLR 2025
The proposed approach feeds the encoder with low-resolution features while supplying the decoder with high-resolution features from the backbone. This method achieves an effective speed-accuracy tradeoff. The results are demonstrated using MS-COCO dataset.
The idea of combining low-resolution and high-resolution features or using multiscale features to enhance DETR performance is not new, as similar approaches have been used in previous studies, such as Zhang et al., (2023a); Zhao et al., (2024b); and Li et al., (2023). How do you justify the novelty of your contributions? The proposed approach has only been evaluated on a single dataset, MS COCO. While it outperforms the baselines, as shown in Table 1, it achieves only competitive results compa
1. CRED improves the computational efficiency of DETR by allowing for the accuracy of high-resolution detection while operating at the speed of low-resolution processes. 2. Enhance the DETR block CRAM and OSMA are two main module for CRED. CRAM can facilitate the info transfer between high resolution and low resolution. OSMA simplifying the integration of detailed features across different scales. 3. The ablation experiment is comprehensive and adequate.
1. More datasets is better, only on MS-COCO2017 didn't quite convince me. 2. While FLOPs and FPS metrics are provided, a more in-depth discussion of the tradeoffs between accuracy, computational cost, and speed of inference would be beneficial a lot. 3. Why is augmenting global information to high-resolution features better than sparse sampling methods(e.g.,IFMA)? 4. The main experiments have used DETR as a baseline, and I would like to know how the method performs on other more advanced dete
1. The CRED mechanism is a sound approach to reduce the computational cost of DETR models by balancing low-resolution encoding with high-resolution decoding. 2. Extensive experiments are implmented of the CRED-enhanced DETR models on the MS-COCO 2017 benchmark. The results show large improvements in FLOPs and runtime without loss in detection accuracy. 3. The paper provides detailed ablation studies, analyzing the contributions of the individual CRAM and OSMA modules. This shows the influence o
1. The overall design is more like an engineering design and lacks technical novelty. While the empirical performance improvements are well-documented, the paper could benefit from a more in-depth theoretical analysis of the CRED mechanism. For example, why does transferring low-resolution encoder information to high-resolution decoder inputs improve performance? 2. While the authors claim that CRED improves the detection of small objects, the improvements in average precision for small objects
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Blind Source Separation Techniques · Industrial Vision Systems and Defect Detection
MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Feedforward Network · Convolution · Dropout
