TL;DR
REN is a novel region encoding method that significantly speeds up region representation generation from patch-based image encoders, improves quality, and outperforms existing methods in segmentation and retrieval tasks.
Contribution
REN introduces a lightweight module for direct region token generation, bypassing segmentation, achieving 60x faster processing with less memory and better quality.
Findings
Outperforms original encoders in segmentation and retrieval
Achieves state-of-the-art on Ego4D VQ2D benchmark
Faster and more memory-efficient than prior methods
Abstract
We introduce the Region Encoder Network (REN), a fast and effective model for generating region-based image representations using point prompts. Recent methods combine class-agnostic segmenters (e.g., SAM) with patch-based image encoders (e.g., DINO) to produce compact and effective region representations, but they suffer from high computational cost due to the segmentation step. REN bypasses this bottleneck using a lightweight module that directly generates region tokens, enabling 60x faster token generation with 35x less memory, while also improving token quality. It uses a few cross-attention blocks that take point prompts as queries and features from a patch-based image encoder as keys and values to produce region tokens that correspond to the prompted objects. We train REN with three popular encoders-DINO, DINOv2, and OpenCLIP-and show that it can be extended to other encoders…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
