Cross-Domain Semantic Segmentation with Large Language Model-Assisted Descriptor Generation
Philip Hughes, Larry Burns, Luke Adams

TL;DR
LangSeg introduces a novel approach that leverages large language models to generate fine-grained descriptors, enhancing semantic segmentation performance across diverse scenes without extensive retraining.
Contribution
The paper presents LangSeg, a new LLM-guided segmentation framework that integrates descriptor generation with Vision Transformers, improving accuracy and flexibility in semantic segmentation tasks.
Findings
Achieves up to 6.1% improvement in mIoU on ADE20K and COCO-Stuff datasets.
Outperforms state-of-the-art models in semantic segmentation.
Validated through ablation and human evaluation studies.
Abstract
Semantic segmentation plays a crucial role in enabling machines to understand and interpret visual scenes at a pixel level. While traditional segmentation methods have achieved remarkable success, their generalization to diverse scenes and unseen object categories remains limited. Recent advancements in large language models (LLMs) offer a promising avenue for bridging visual and textual modalities, providing a deeper understanding of semantic relationships. In this paper, we propose LangSeg, a novel LLM-guided semantic segmentation method that leverages context-sensitive, fine-grained subclass descriptors generated by LLMs. Our framework integrates these descriptors with a pre-trained Vision Transformer (ViT) to achieve superior segmentation performance without extensive model retraining. We evaluate LangSeg on two challenging datasets, ADE20K and COCO-Stuff, where it outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Web Data Mining and Analysis
MethodsAttention Is All You Need · Softmax · Adam · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Vision Transformer · Multi-Head Attention
