YOLOE-26: Integrating YOLO26 with YOLOE for Real-Time Open-Vocabulary Instance Segmentation
Ranjan Sapkota, Manoj Karkee

TL;DR
YOLOE-26 is a real-time, open-vocabulary instance segmentation framework that combines YOLO26's efficiency with advanced open-vocabulary reasoning, enabling flexible, prompt-based, and autonomous segmentation in real-world scenarios.
Contribution
It introduces a novel architecture integrating open-vocabulary learning with YOLO26, including a unified embedding space and multiple prompt modalities for real-time segmentation.
Findings
Consistent scaling and accuracy-efficiency trade-offs demonstrated across models.
Effective zero-shot and prompt-based segmentation in real-time.
Compatible with large-scale detection and grounding datasets.
Abstract
This paper presents YOLOE-26, a unified framework that integrates the deployment-optimized YOLO26(or YOLOv26) architecture with the open-vocabulary learning paradigm of YOLOE for real-time open-vocabulary instance segmentation. Building on the NMS-free, end-to-end design of YOLOv26, the proposed approach preserves the hallmark efficiency and determinism of the YOLO family while extending its capabilities beyond closed-set recognition. YOLOE-26 employs a convolutional backbone with PAN/FPN-style multi-scale feature aggregation, followed by end-to-end regression and instance segmentation heads. A key architectural contribution is the replacement of fixed class logits with an object embedding head, which formulates classification as similarity matching against prompt embeddings derived from text descriptions, visual examples, or a built-in vocabulary. To enable efficient open-vocabulary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
