P$^2$HCT: Plug-and-Play Hierarchical C2F Transformer for Multi-Scale Feature Fusion
Junyi Hu, Tian Bai, Fengyi Wu, Zhenming Peng, Yi Zhang

TL;DR
P$^2$HCT is a lightweight, plug-and-play hierarchical transformer module that improves multi-scale feature fusion efficiency and accuracy in vision models, suitable for resource-constrained environments.
Contribution
The paper introduces P$^2$HCT, a novel hierarchical transformer module that reduces computational overhead while enhancing feature fusion for detection and classification tasks.
Findings
P$^2$HCT improves mAP by up to 0.9% on MS COCO.
Embedding P$^2$HCT into ResNet backbones boosts ImageNet top-1 accuracy.
P$^2$HCT achieves these gains with minimal latency increase.
Abstract
Feature fusion plays a pivotal role in achieving high performance in vision models, yet existing attention-based fusion techniques often suffer from substantial computational overhead and implementation complexity, particularly in resource-constrained settings. To address these limitations, we introduce the Plug-and-Play Hierarchical C2F Transformer (PHCT), a lightweight module that combines coarse-to-fine token selection with shared attention parameters to preserve spatial details while reducing inference cost. PHCT is trainable using coarse attention alone and can be seamlessly activated at inference to enhance accuracy without retraining. Integrated into real-time detectors such as YOLOv11-N/S/M, PHCT achieves mAP gains of 0.9\%, 0.5\%, and 0.4\% on MS COCO with minimal latency increase. Similarly, embedding PHCT into ResNet-18/50/101 backbones improves ImageNet top-1…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
