Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation
Weiming Zhang, Dingwen Xiao, Songyue Guo, Guangyu Xiang, Shiqi Wen, Minwei Zhao, Lei Chen, Lin Wang

TL;DR
Tarot-SAM3 is a training-free framework that enhances the Segment Anything Model 3 for accurate referring expression segmentation, handling both explicit and implicit expressions without additional training.
Contribution
It introduces a novel two-phase approach with reasoning-assisted prompts and self-refinement, enabling SAM3 to generalize to any referring expression without training.
Findings
Achieves strong performance on RES benchmarks and open-world scenarios.
Effectively handles both explicit and implicit referring expressions.
Validation through extensive ablation studies.
Abstract
Referring Expression Segmentation (RES) aims to segment image regions described by natural-language expressions, serving as a bridge between vision and language understanding. Existing RES methods, however, rely heavily on large annotated datasets and are limited to either explicit or implicit expressions, hindering their ability to generalize to any referring expression. Recently, the Segment Anything Model 3 (SAM3) has shown impressive robustness in Promptable Concept Segmentation. Nonetheless, applying it to RES remains challenging: (1) SAM3 struggles with longer or implicit expressions; (2) naive coupling of SAM3 with a multimodal large language model (MLLM) makes the final results overly dependent on the MLLM's reasoning capability, without enabling refinement of SAM3's segmentation outputs. To this end, we present Tarot-SAM3, a novel training-free framework that can accurately…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
