Moondream Segmentation: From Words to Masks

Ethan Reid

arXiv:2604.02593·cs.CV·April 6, 2026

Moondream Segmentation: From Words to Masks

Ethan Reid

PDF

TL;DR

Moondream Segmentation extends a vision-language model to produce detailed masks from referring expressions, using autoregressive decoding and reinforcement learning for improved accuracy.

Contribution

Introduces a new segmentation method with reinforcement learning and a refined dataset, achieving state-of-the-art results on benchmark datasets.

Findings

01

Achieves 80.2% cIoU on RefCOCO validation set.

02

Attains 62.6% mIoU on LVIS validation set.

03

Provides a new cleaned dataset, RefCOCO-M, for better evaluation.

Abstract

We present Moondream Segmentation, a referring image segmentation extension of Moondream 3, a vision-language model. Given an image and a referring expression, the model autoregressively decodes a vector path and iteratively refines the rasterized mask into a final detailed mask. We introduce a reinforcement learning stage that resolves ambiguity in the supervised signal by directly optimizing mask quality. Rollouts from this stage produce coarse-to-ground-truth targets for the refiner. To mitigate evaluation noise from polygon annotations, we release RefCOCO-M, a cleaned RefCOCO validation split with boundary-accurate masks. Moondream Segmentation achieves a cIoU of 80.2% on RefCOCO (val) and 62.6% mIoU on LVIS (val).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.