P$^2$HCT: Plug-and-Play Hierarchical C2F Transformer for Multi-Scale Feature Fusion

Junyi Hu; Tian Bai; Fengyi Wu; Zhenming Peng; Yi Zhang

arXiv:2505.12772·cs.CV·March 31, 2026·2 cites

P$^2$HCT: Plug-and-Play Hierarchical C2F Transformer for Multi-Scale Feature Fusion

Junyi Hu, Tian Bai, Fengyi Wu, Zhenming Peng, Yi Zhang

PDF

TL;DR

P$^2$HCT is a lightweight, plug-and-play hierarchical transformer module that improves multi-scale feature fusion efficiency and accuracy in vision models, suitable for resource-constrained environments.

Contribution

The paper introduces P$^2$HCT, a novel hierarchical transformer module that reduces computational overhead while enhancing feature fusion for detection and classification tasks.

Findings

01

P$^2$HCT improves mAP by up to 0.9% on MS COCO.

02

Embedding P$^2$HCT into ResNet backbones boosts ImageNet top-1 accuracy.

03

P$^2$HCT achieves these gains with minimal latency increase.

Abstract

Feature fusion plays a pivotal role in achieving high performance in vision models, yet existing attention-based fusion techniques often suffer from substantial computational overhead and implementation complexity, particularly in resource-constrained settings. To address these limitations, we introduce the Plug-and-Play Hierarchical C2F Transformer (P $^{2}$ HCT), a lightweight module that combines coarse-to-fine token selection with shared attention parameters to preserve spatial details while reducing inference cost. P $^{2}$ HCT is trainable using coarse attention alone and can be seamlessly activated at inference to enhance accuracy without retraining. Integrated into real-time detectors such as YOLOv11-N/S/M, P $^{2}$ HCT achieves mAP gains of 0.9\%, 0.5\%, and 0.4\% on MS COCO with minimal latency increase. Similarly, embedding P $^{2}$ HCT into ResNet-18/50/101 backbones improves ImageNet top-1…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.