Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding
Thinesh Thiyakesan Ponbagavathi, Constantin Seibold, Alina Roitberg

TL;DR
Frame2Freq introduces spectral adapters using FFT to capture multi-scale temporal dynamics, significantly improving fine-grained video understanding by outperforming prior methods and even fully fine-tuned models.
Contribution
It proposes a novel frequency-aware adapter that encodes spectral information during image-to-video transfer, enhancing temporal analysis in pretrained vision models.
Findings
Outperforms prior PEFT methods on five datasets
Surpasses fully fine-tuned models on four datasets
Demonstrates effectiveness of frequency analysis in temporal modeling
Abstract
Adapting image-pretrained backbones to video typically relies on time-domain adapters tuned to a single temporal scale. Our experiments show that these modules pick up static image cues and very fast flicker changes, while overlooking medium-speed motion. Capturing dynamics across multiple time-scales is, however, crucial for fine-grained temporal analysis (i.e., opening vs. closing bottle). To address this, we introduce Frame2Freq -- a family of frequency-aware adapters that perform spectral encoding during image-to-video adaptation of pretrained Vision Foundation Models (VFMs), improving fine-grained action recognition. Frame2Freq uses Fast Fourier Transform (FFT) along time and learns frequency-band specific embeddings that adaptively highlight the most discriminative frequency ranges. Across five fine-grained activity recognition datasets, Frame2Freq outperforms prior PEFT methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Gait Recognition and Analysis · Domain Adaptation and Few-Shot Learning
