AutoFocus-IL: VLM-based Saliency Maps for Data-Efficient Visual Imitation Learning without Extra Human Annotations

Litian Gong; Fatemeh Bahrani; Yutai Zhou; Amin Banayeeanzade; Jiachen Li; Erdem B{\i}y{\i}k

arXiv:2511.18617·cs.RO·November 26, 2025

AutoFocus-IL: VLM-based Saliency Maps for Data-Efficient Visual Imitation Learning without Extra Human Annotations

Litian Gong, Fatemeh Bahrani, Yutai Zhou, Amin Banayeeanzade, Jiachen Li, Erdem B{\i}y{\i}k

PDF

Open Access

TL;DR

AutoFocus-IL uses vision-language models to automatically generate saliency maps that improve data efficiency and generalization in visual imitation learning without requiring costly human annotations.

Contribution

It introduces a novel VLM-based saliency regularization method that automatically identifies task-relevant features, enhancing imitation learning performance.

Findings

01

Outperforms standard behavior cloning in simulation and real robot tasks.

02

Surpasses state-of-the-art methods requiring human supervision.

03

Improves focus on task-relevant cues, reducing distractor influence.

Abstract

AutoFocus-IL is a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Although saliency regularization has emerged as a promising way to achieve this, existing approaches typically require costly supervision such as human gaze data or manual saliency annotations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Social Robot Interaction and HRI