ViTNF: Leveraging Neural Fields to Boost Vision Transformers in Generalized Category Discovery
Jiayi Su, Dequan Jin

TL;DR
This paper introduces ViTNF, a novel neural field-based architecture replacing the MLP head in Vision Transformers, significantly improving generalized category discovery performance with reduced training complexity.
Contribution
The paper proposes a neural field-based classifier for Vision Transformers, simplifying training and enhancing accuracy in generalized category discovery tasks.
Findings
Outperforms state-of-the-art on CIFAR-100, ImageNet-100, CUB-200, and Standard Cars.
Achieves 19% and 16% accuracy improvements in new and all classes.
Reduces training sample requirements and training difficulty.
Abstract
Generalized category discovery (GCD) is a highly popular task in open-world recognition, aiming to identify unknown class samples using known class data. By leveraging pre-training, meta-training, and fine-tuning, ViT achieves excellent few-shot learning capabilities. Its MLP head is a feedforward network, trained synchronously with the entire network in the same process, increasing the training cost and difficulty without fully leveraging the power of the feature extractor. This paper proposes a new architecture by replacing the MLP head with a neural field-based one. We first present a new static neural field function to describe the activity distribution of the neural field and then use two static neural field functions to build an efficient few-shot classifier. This neural field-based (NF) classifier consists of two coupled static neural fields. It stores the feature information of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Digital Imaging for Blood Diseases · Domain Adaptation and Few-Shot Learning
