Talk2SAM: Text-Guided Semantic Enhancement for Complex-Shaped Object Segmentation
Luka Vetoshkin, Dmitry Yudin

TL;DR
Talk2SAM introduces a text-guided method that enhances segmentation of complex-shaped objects by integrating semantic information from user prompts into existing models, significantly improving accuracy and user control.
Contribution
It presents a novel approach combining CLIP and DINO features to improve segmentation of challenging objects and enables user-controllable segmentation based on textual prompts.
Findings
Achieves up to +5.9% IoU improvement over SAM-HQ.
Enhances segmentation of thin and complex structures.
Provides flexible, user-controllable segmentation with natural language prompts.
Abstract
Segmenting objects with complex shapes, such as wires, bicycles, or structural grids, remains a significant challenge for current segmentation models, including the Segment Anything Model (SAM) and its high-quality variant SAM-HQ. These models often struggle with thin structures and fine boundaries, leading to poor segmentation quality. We propose Talk2SAM, a novel approach that integrates textual guidance to improve segmentation of such challenging objects. The method uses CLIP-based embeddings derived from user-provided text prompts to identify relevant semantic regions, which are then projected into the DINO feature space. These features serve as additional prompts for SAM-HQ, enhancing its ability to focus on the target object. Beyond improving segmentation accuracy, Talk2SAM allows user-controllable segmentation, enabling disambiguation of objects within a single bounding box based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Big Data and Digital Economy
MethodsLinear Layer · Softmax · Attention Is All You Need · Multi-Head Attention · Dense Connections · Residual Connection · Layer Normalization · Vision Transformer · Focus · self-DIstillation with NO labels
