Deformable Attentive Visual Enhancement for Referring Segmentation Using Vision-Language Model
Alaa Dalaq, Muzammil Behzad

TL;DR
SegVLM is a novel vision-language model that enhances referring image segmentation by integrating deformable convolutions, SE blocks, residual connections, and a new RAF loss, leading to improved accuracy and generalization.
Contribution
The paper introduces SegVLM, a new model with architectural innovations and a referring-aware fusion loss for better cross-modal alignment and segmentation performance.
Findings
Each component improves segmentation accuracy.
Model generalizes well across datasets.
Achieves state-of-the-art results in referring segmentation.
Abstract
Image segmentation is a fundamental task in computer vision, aimed at partitioning an image into semantically meaningful regions. Referring image segmentation extends this task by using natural language expressions to localize specific objects, requiring effective integration of visual and linguistic information. In this work, we propose SegVLM, a vision-language model that incorporates architectural improvements to enhance segmentation accuracy and cross-modal alignment. The model integrates squeeze-and-excitation (SE) blocks for dynamic feature recalibration, deformable convolutions for geometric adaptability, and residual connections for deep feature learning. We also introduce a novel referring-aware fusion (RAF) loss that balances region-level alignment, boundary precision, and class imbalance. Extensive experiments and ablation studies demonstrate that each component contributes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
