TL;DR
TextRegion leverages frozen image-text models and segmentation techniques to produce detailed, text-aligned region tokens, enhancing open-vocabulary visual understanding without additional training.
Contribution
We introduce TextRegion, a training-free framework combining image-text models and SAM2 for detailed, open-vocabulary region tokens applicable to various visual tasks.
Findings
Achieves superior or competitive performance on multiple tasks.
Compatible with various image-text models.
Effective without additional training.
Abstract
Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
