KAN-FPN-Stem:A KAN-Enhanced Feature Pyramid Stem for Boosting ViT-based Pose Estimation
HaoNan Tang

TL;DR
This paper introduces a KAN-enhanced FPN-Stem architecture that improves ViT-based pose estimation by replacing the standard smoothing convolution with a KAN-based layer, significantly boosting performance on COCO.
Contribution
The work reveals that the main bottleneck in ViT front-end performance is in feature fusion, and proposes a novel KAN-based convolutional layer to address this issue.
Findings
Achieved up to +2.0 AP improvement on COCO dataset.
Identified feature fusion, not attention, as the key bottleneck in ViT pose estimation.
Provided a plug-and-play module that enhances multi-scale feature fusion.
Abstract
Vision Transformers (ViT) have demonstrated significant promise in dense prediction tasks such as pose estimation. However, their performance is frequently constrained by the overly simplistic front-end designs employed in models like ViTPose. This naive patchification mechanism struggles to effectively handle multi-scale variations and results in irreversible information loss during the initial feature extraction phase. To overcome this limitation, we introduce a novel KAN-enhanced FPN-Stem architecture. Through rigorous ablation studies, we first identified that the true bottleneck for performance improvement lies not in plug-and-play attention modules (e.g., CBAM), but in the post-fusion non-linear smoothing step within the FPN. Guided by this insight, our core innovation is to retain the classic "upsample-and-add" fusion stream of the FPN, but replace its terminal, standard linear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Robotics and Sensor-Based Localization
