Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation

Sangmim Song; Sarath Kodagoda; Marc Carmichael; Karthick Thiyagarajan

arXiv:2603.10340·cs.CV·March 12, 2026

Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation

Sangmim Song, Sarath Kodagoda, Marc Carmichael, Karthick Thiyagarajan

PDF

Open Access

TL;DR

This paper introduces Concept-Gated Visual Distillation (CGVD), a novel inference framework that enhances vision-language action models' robustness in cluttered environments by suppressing distractors and stabilizing manipulation policies.

Contribution

The paper presents a training-free, model-agnostic method that improves VLA model performance in cluttered scenes through scene parsing, target refinement, and Fourier-based inpainting.

Findings

01

CGVD prevents performance collapse in cluttered environments.

02

Achieves 77.5% success rate in dense distractor scenarios, outperforming baseline.

03

Enforces strict attribute adherence for robust robotic manipulation.

Abstract

Vision-Language-Action (VLA) models demonstrate impressive zero-shot generalization but frequently suffer from a "Precision-Reasoning Gap" in cluttered environments. This failure is driven by background-induced feature dilution, where high-frequency semantic noise corrupts the geometric grounding required for precise manipulation. To bridge this gap, we propose Concept-Gated Visual Distillation (CGVD), a training-free, model-agnostic inference framework that stabilizes VLA policies. CGVD operates by parsing instructions into safe and distractor sets, utilizing a two-layer target refinement process--combining cross-validation and spatial disambiguation--to explicitly penalize false positives and isolate genuine manipulation targets. We then process the scene via Fourier-based inpainting, generating a clean observation that actively suppresses semantic distractors while preserving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Domain Adaptation and Few-Shot Learning