T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability

Savya Khosla; Sethuraman T V; Aryan Chadha; Alex Schwing; Derek Hoiem

arXiv:2604.18573·cs.CV·April 21, 2026

T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability

Savya Khosla, Sethuraman T V, Aryan Chadha, Alex Schwing, Derek Hoiem

PDF

1 Repo 1 Models

TL;DR

T-REN introduces a lightweight, text-aligned region encoder that enhances dense vision-language understanding and scalability, significantly reducing token counts while improving performance across multiple vision-language tasks.

Contribution

It proposes a novel, efficient encoder that maps visual data to compact, text-aligned region tokens, improving dense cross-modal understanding with minimal additional parameters.

Findings

01

+5.9 mIoU on ADE20K segmentation

02

+18.4% recall on COCO retrieval

03

+15.6% recall on Ego4D localization

Abstract

Despite recent progress, vision-language encoders struggle with two core limitations: (1) weak alignment between language and dense vision features, which hurts tasks like open-vocabulary semantic segmentation; and (2) high token counts for fine-grained visual representations, which limits scalability to long videos. This work addresses both limitations. We propose T-REN (Text-aligned Region Encoder Network), an efficient encoder that maps visual data to a compact set of text-aligned region-level representations (or region tokens). T-REN achieves this through a lightweight network added on top of a frozen vision backbone, trained to pool patch-level representations within each semantic region into region tokens and align them with region-level text annotations. With only 3.7% additional parameters compared to the vision-language backbone, this design yields substantially stronger dense…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

savya08/T-REN
github

Models

🤗
savyak2/T-REN
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.