Steerable Visual Representations
Jona Ruthardt, Manu Gaur, Deva Ramanan, Makarand Tapaswi, Yuki M. Asano

TL;DR
This paper introduces steerable visual representations that can be directed by natural language to focus on specific image concepts, enhancing flexibility and performance in various visual tasks.
Contribution
It proposes a novel method injecting text into visual encoders via early fusion, enabling steerability and improved zero-shot generalization in visual representations.
Findings
Our steerable features can focus on any desired objects in images.
The approach outperforms or matches dedicated methods on anomaly detection.
It maintains representation quality while being steerable by natural language.
Abstract
Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
