Talk2SAM: Text-Guided Semantic Enhancement for Complex-Shaped Object Segmentation

Luka Vetoshkin; Dmitry Yudin

arXiv:2506.05396·cs.CV·June 9, 2025

Talk2SAM: Text-Guided Semantic Enhancement for Complex-Shaped Object Segmentation

Luka Vetoshkin, Dmitry Yudin

PDF

Open Access

TL;DR

Talk2SAM introduces a text-guided method that enhances segmentation of complex-shaped objects by integrating semantic information from user prompts into existing models, significantly improving accuracy and user control.

Contribution

It presents a novel approach combining CLIP and DINO features to improve segmentation of challenging objects and enables user-controllable segmentation based on textual prompts.

Findings

01

Achieves up to +5.9% IoU improvement over SAM-HQ.

02

Enhances segmentation of thin and complex structures.

03

Provides flexible, user-controllable segmentation with natural language prompts.

Abstract

Segmenting objects with complex shapes, such as wires, bicycles, or structural grids, remains a significant challenge for current segmentation models, including the Segment Anything Model (SAM) and its high-quality variant SAM-HQ. These models often struggle with thin structures and fine boundaries, leading to poor segmentation quality. We propose Talk2SAM, a novel approach that integrates textual guidance to improve segmentation of such challenging objects. The method uses CLIP-based embeddings derived from user-provided text prompts to identify relevant semantic regions, which are then projected into the DINO feature space. These features serve as additional prompts for SAM-HQ, enhancing its ability to focus on the target object. Beyond improving segmentation accuracy, Talk2SAM allows user-controllable segmentation, enabling disambiguation of objects within a single bounding box based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Big Data and Digital Economy

MethodsLinear Layer · Softmax · Attention Is All You Need · Multi-Head Attention · Dense Connections · Residual Connection · Layer Normalization · Vision Transformer · Focus · self-DIstillation with NO labels