TL;DR
This paper introduces a Spectral-decomposed Token (SET) framework that enhances domain generalization in semantic segmentation by decomposing features into style and content components and optimizing style-invariant feature learning.
Contribution
The novel SET framework decomposes frozen VFM features into frequency components and employs an attention method to improve style-invariant feature extraction for better cross-domain segmentation.
Findings
Achieves state-of-the-art results on cross-domain semantic segmentation tasks.
Effectively separates style and content information in frequency space.
Enhances style-invariant feature learning through attention optimization.
Abstract
The rapid development of Vision Foundation Model (VFM) brings inherent out-domain generalization for a variety of down-stream tasks. Among them, domain generalized semantic segmentation (DGSS) holds unique challenges as the cross-domain images share common pixel-wise content information but vary greatly in terms of the style. In this paper, we present a novel Spectral-dEcomposed Token (SET) learning framework to advance the frontier. Delving into further than existing fine-tuning token & frozen backbone paradigm, the proposed SET especially focuses on the way learning style-invariant features from these learnable tokens. Particularly, the frozen VFM features are first decomposed into the phase and amplitude components in the frequency space, which mainly contain the information of content and style, respectively, and then separately processed by learnable tokens for task-specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training
