Frequency-Adaptive Discrete Cosine-ViT-ResNet Architecture for Sparse-Data Vision

Ziyue Kang; Weichuan Zhang

arXiv:2505.22701·cs.CV·May 1, 2026

Frequency-Adaptive Discrete Cosine-ViT-ResNet Architecture for Sparse-Data Vision

Ziyue Kang, Weichuan Zhang

PDF

TL;DR

This paper introduces a hybrid deep-learning framework with adaptive frequency-domain DCT filtering, ViT and ResNet backbones, and Bayesian classification to improve rare animal image classification with scarce data.

Contribution

It presents the first adaptive frequency selection mechanism in DCT preprocessing tailored for deep backbones in sparse-data vision tasks.

Findings

01

Outperforms conventional CNN and fixed-band DCT pipelines.

02

Achieves state-of-the-art accuracy on a 50-class wildlife dataset.

03

Effectively handles extreme sample scarcity.

Abstract

A major challenge in rare animal image classification is the scarcity of data, as many species usually have only a small number of labeled samples. To address this challenge, we designed a hybrid deep-learning framework comprising a novel adaptive DCT preprocessing module, ViT-B16 and ResNet50 backbones, and a Bayesian linear classification head. To our knowledge, we are the first to introduce an adaptive frequency-domain selection mechanism that learns optimal low-, mid-, and high-frequency boundaries suited to the subsequent backbones. Our network first captures image frequency-domain cues via this adaptive DCT partitioning. The adaptively filtered frequency features are then fed into ViT-B16 to model global contextual relationships, while ResNet50 concurrently extracts local, multi-scale spatial representations from the original image. A cross-level fusion strategy seamlessly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.