Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation
Sangmim Song, Sarath Kodagoda, Marc Carmichael, Karthick Thiyagarajan

TL;DR
This paper introduces Concept-Gated Visual Distillation (CGVD), a novel inference framework that enhances vision-language action models' robustness in cluttered environments by suppressing distractors and stabilizing manipulation policies.
Contribution
The paper presents a training-free, model-agnostic method that improves VLA model performance in cluttered scenes through scene parsing, target refinement, and Fourier-based inpainting.
Findings
CGVD prevents performance collapse in cluttered environments.
Achieves 77.5% success rate in dense distractor scenarios, outperforming baseline.
Enforces strict attribute adherence for robust robotic manipulation.
Abstract
Vision-Language-Action (VLA) models demonstrate impressive zero-shot generalization but frequently suffer from a "Precision-Reasoning Gap" in cluttered environments. This failure is driven by background-induced feature dilution, where high-frequency semantic noise corrupts the geometric grounding required for precise manipulation. To bridge this gap, we propose Concept-Gated Visual Distillation (CGVD), a training-free, model-agnostic inference framework that stabilizes VLA policies. CGVD operates by parsing instructions into safe and distractor sets, utilizing a two-layer target refinement process--combining cross-validation and spatial disambiguation--to explicitly penalize false positives and isolate genuine manipulation targets. We then process the scene via Fourier-based inpainting, generating a clean observation that actively suppresses semantic distractors while preserving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Domain Adaptation and Few-Shot Learning
