TL;DR
T-REN introduces a lightweight, text-aligned region encoder that enhances dense vision-language understanding and scalability, significantly reducing token counts while improving performance across multiple vision-language tasks.
Contribution
It proposes a novel, efficient encoder that maps visual data to compact, text-aligned region tokens, improving dense cross-modal understanding with minimal additional parameters.
Findings
+5.9 mIoU on ADE20K segmentation
+18.4% recall on COCO retrieval
+15.6% recall on Ego4D localization
Abstract
Despite recent progress, vision-language encoders struggle with two core limitations: (1) weak alignment between language and dense vision features, which hurts tasks like open-vocabulary semantic segmentation; and (2) high token counts for fine-grained visual representations, which limits scalability to long videos. This work addresses both limitations. We propose T-REN (Text-aligned Region Encoder Network), an efficient encoder that maps visual data to a compact set of text-aligned region-level representations (or region tokens). T-REN achieves this through a lightweight network added on top of a frozen vision backbone, trained to pool patch-level representations within each semantic region into region tokens and align them with region-level text annotations. With only 3.7% additional parameters compared to the vision-language backbone, this design yields substantially stronger dense…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
