VACoT: Rethinking Visual Data Augmentation with VLMs

Zhengzhuo Xu; Chong Sun; SiNan Du; Chen Li; Jing Lyu; Chun Yuan

arXiv:2512.02361·cs.CV·December 3, 2025

VACoT: Rethinking Visual Data Augmentation with VLMs

Zhengzhuo Xu, Chong Sun, SiNan Du, Chen Li, Jing Lyu, Chun Yuan

PDF

Open Access

TL;DR

VACoT introduces a dynamic visual augmentation framework that enhances robustness of vision models during inference through structured, post-hoc image transformations, significantly improving performance on challenging perception tasks.

Contribution

The paper proposes VACoT, a novel inference-time augmentation method using reinforcement learning to improve vision model robustness with minimal training overhead.

Findings

01

VACoT improves robustness on out-of-distribution inputs.

02

VACoT enhances OCR performance in adversarial scenarios.

03

Extensive experiments validate VACoT's effectiveness across 13 benchmarks.

Abstract

While visual data augmentation remains a cornerstone for training robust vision models, it has received limited attention in visual language models (VLMs), which predominantly rely on large-scale real data acquisition or synthetic diversity. Consequently, they may struggle with basic perception tasks that conventional models handle reliably. Given the substantial cost of pre-training and fine-tuning VLMs, continue training on augmented data yields limited and diminishing returns. In this paper, we present Visual Augmentation Chain-of-Thought (VACoT), a framework that dynamically invokes image augmentations during model inference. By incorporating post-hoc transformations such as denoising, VACoT substantially improves robustness on challenging and out-of-distribution inputs, especially in OCR-related adversarial scenarios. Distinct from prior approaches limited to local cropping, VACoT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis