Diffusion-Enhanced Test-time Adaptation with Text and Image Augmentation
Chun-Mei Feng, Yuanyang He, Jian Zou, Salman Khan, Huan, Xiong, Zhen Li, Wangmeng Zuo, Rick Siow Mong Goh, Yong Liu

TL;DR
This paper introduces IT3A, a multi-modal test-time adaptation method that uses generative models for data augmentation across text and images, significantly improving accuracy under distribution shifts.
Contribution
IT3A leverages pre-trained vision and language models for multi-modal augmentation and employs cosine similarity filtering, offering a novel approach to test-time adaptation beyond single-modality methods.
Findings
Outperforms state-of-the-art TPT methods by 5.50% in accuracy.
Effectively filters spurious augmentations using cosine similarity.
Enhances model robustness to distribution shifts and domain gaps.
Abstract
Existing test-time prompt tuning (TPT) methods focus on single-modality data, primarily enhancing images and using confidence ratings to filter out inaccurate images. However, while image generation models can produce visually diverse images, single-modality data enhancement techniques still fail to capture the comprehensive knowledge provided by different modalities. Additionally, we note that the performance of TPT-based methods drops significantly when the number of augmented images is limited, which is not unusual given the computational expense of generative augmentation. To address these issues, we introduce IT3A, a novel test-time adaptation method that utilizes a pre-trained generative model for multi-modal augmentation of each test sample from unknown new domains. By combining augmented data from pre-trained vision and language models, we enhance the ability of the model to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Advanced Vision and Imaging · Advanced Image Processing Techniques
MethodsFocus · Adapter
