VACoT: Rethinking Visual Data Augmentation with VLMs
Zhengzhuo Xu, Chong Sun, SiNan Du, Chen Li, Jing Lyu, Chun Yuan

TL;DR
VACoT introduces a dynamic visual augmentation framework that enhances robustness of vision models during inference through structured, post-hoc image transformations, significantly improving performance on challenging perception tasks.
Contribution
The paper proposes VACoT, a novel inference-time augmentation method using reinforcement learning to improve vision model robustness with minimal training overhead.
Findings
VACoT improves robustness on out-of-distribution inputs.
VACoT enhances OCR performance in adversarial scenarios.
Extensive experiments validate VACoT's effectiveness across 13 benchmarks.
Abstract
While visual data augmentation remains a cornerstone for training robust vision models, it has received limited attention in visual language models (VLMs), which predominantly rely on large-scale real data acquisition or synthetic diversity. Consequently, they may struggle with basic perception tasks that conventional models handle reliably. Given the substantial cost of pre-training and fine-tuning VLMs, continue training on augmented data yields limited and diminishing returns. In this paper, we present Visual Augmentation Chain-of-Thought (VACoT), a framework that dynamically invokes image augmentations during model inference. By incorporating post-hoc transformations such as denoising, VACoT substantially improves robustness on challenging and out-of-distribution inputs, especially in OCR-related adversarial scenarios. Distinct from prior approaches limited to local cropping, VACoT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
