Latent Expression Generation for Referring Image Segmentation and Grounding
Seonghoon Yu, Junbeom Hong, Joonseok Lee, Jeany Son

TL;DR
This paper introduces a novel framework for visual grounding that generates multiple latent expressions from a single textual description to better capture visual details, improving localization accuracy in RIS and REC tasks.
Contribution
It proposes a new method that creates multiple latent expressions using subject distributor and visual concept injector modules, enhancing visual grounding performance.
Findings
Outperforms state-of-the-art RIS and REC methods on multiple benchmarks.
Achieves superior results on the generalized referring expression segmentation (GRES) benchmark.
Effectively captures diverse visual attributes through latent expression generation.
Abstract
Visual grounding tasks, such as referring image segmentation (RIS) and referring expression comprehension (REC), aim to localize a target object based on a given textual description. The target object in an image can be described in multiple ways, reflecting diverse attributes such as color, position, and more. However, most existing methods rely on a single textual input, which captures only a fraction of the rich information available in the visual domain. This mismatch between rich visual details and sparse textual cues can lead to the misidentification of similar objects. To address this, we propose a novel visual grounding framework that leverages multiple latent expressions generated from a single textual input by incorporating complementary visual details absent from the original description. Specifically, we introduce subject distributor and visual concept injector modules to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
