Aligning Forest and Trees in Images & Long Captions for Visually Grounded Understanding

Byeongju Woo; Zilin Wang; Byeonghyun Pak; Sangwoo Mo; Stella X. Yu

arXiv:2602.02977·cs.CV·May 14, 2026

Aligning Forest and Trees in Images & Long Captions for Visually Grounded Understanding

Byeongju Woo, Zilin Wang, Byeonghyun Pak, Sangwoo Mo, Stella X. Yu

PDF

TL;DR

This paper introduces CAFT, a hierarchical vision-language model that improves understanding of long, detailed captions by aligning local scene parts with text, achieving state-of-the-art results in long-text retrieval.

Contribution

CAFT is the first model to jointly learn local part-text and global image-text alignment for detailed scene understanding without explicit region supervision.

Findings

01

CAFT outperforms previous models on six long-text retrieval benchmarks.

02

It learns fine-grained, localized semantic representations without explicit supervision.

03

The model demonstrates strong scaling behavior with large-scale training.

Abstract

Vision-language models such as CLIP often struggle to faithfully understand long, detail-rich captions, relying on dominant scene cues while overlooking fine-grained visual evidence. We propose a hierarchical vision-language learning principle for understanding scenes as part-to-whole compositions: before forming a whole-scene representation, a model should uncover what semantic parts appear where in the image. To this end, we propose CAFT (Cross-domain Alignment of Forests and Trees), a vision-language model that jointly learns local text-region alignment at intermediate representations and global image-text alignment at the final representation. Exploiting the organization of long captions, where local descriptions often correspond to scene parts, CAFT employs a fine-to-coarse image encoder and a part-whole text encoder to discover localized part semantics and progressively compose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.