Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation

Weiming Zhang; Dingwen Xiao; Songyue Guo; Guangyu Xiang; Shiqi Wen; Minwei Zhao; Lei Chen; Lin Wang

arXiv:2604.07916·cs.CV·April 10, 2026

Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation

Weiming Zhang, Dingwen Xiao, Songyue Guo, Guangyu Xiang, Shiqi Wen, Minwei Zhao, Lei Chen, Lin Wang

PDF

TL;DR

Tarot-SAM3 is a training-free framework that enhances the Segment Anything Model 3 for accurate referring expression segmentation, handling both explicit and implicit expressions without additional training.

Contribution

It introduces a novel two-phase approach with reasoning-assisted prompts and self-refinement, enabling SAM3 to generalize to any referring expression without training.

Findings

01

Achieves strong performance on RES benchmarks and open-world scenarios.

02

Effectively handles both explicit and implicit referring expressions.

03

Validation through extensive ablation studies.

Abstract

Referring Expression Segmentation (RES) aims to segment image regions described by natural-language expressions, serving as a bridge between vision and language understanding. Existing RES methods, however, rely heavily on large annotated datasets and are limited to either explicit or implicit expressions, hindering their ability to generalize to any referring expression. Recently, the Segment Anything Model 3 (SAM3) has shown impressive robustness in Promptable Concept Segmentation. Nonetheless, applying it to RES remains challenging: (1) SAM3 struggles with longer or implicit expressions; (2) naive coupling of SAM3 with a multimodal large language model (MLLM) makes the final results overly dependent on the MLLM's reasoning capability, without enabling refinement of SAM3's segmentation outputs. To this end, we present Tarot-SAM3, a novel training-free framework that can accurately…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.