TL;DR
Instruct2See is a zero-shot framework that effectively removes various unseen obstructions from images by unifying the process as a mask restoration task using multi-modal prompts and dynamic mask adaptation.
Contribution
It introduces a novel unified approach for obstruction removal that handles both seen and unseen obstacles using multi-modal prompts and a tunable mask adapter.
Findings
Achieves strong generalization on out-of-distribution obstacles
Performs well on both in-distribution and out-of-distribution data
Demonstrates effective real-time mask adjustment capabilities
Abstract
Images are often obstructed by various obstacles due to capture limitations, hindering the observation of objects of interest. Most existing methods address occlusions from specific elements like fences or raindrops, but are constrained by the wide range of real-world obstructions, making comprehensive data collection impractical. To overcome these challenges, we propose Instruct2See, a novel zero-shot framework capable of handling both seen and unseen obstacles. The core idea of our approach is to unify obstruction removal by treating it as a soft-hard mask restoration problem, where any obstruction can be represented using multi-modal prompts, such as visual semantics and textual instructions, processed through a cross-attention unit to enhance contextual understanding and improve mode control. Additionally, a tunable mask adapter allows for dynamic soft masking, enabling real-time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
