Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

Xuanyu Zhu; Yan Bai; Yang Shi; Yihang Lou; Yuanxing Zhang; Jing Jin; Yuan Zhou

arXiv:2605.10780·cs.CV·May 13, 2026

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

Xuanyu Zhu, Yan Bai, Yang Shi, Yihang Lou, Yuanxing Zhang, Jing Jin, Yuan Zhou

PDF

1 Repo

TL;DR

This paper introduces DRoRAE, a multi-layer feature fusion method for vision autoencoders that enhances image reconstruction and generation by leveraging hierarchical information across encoder layers.

Contribution

It proposes a novel fusion module with adaptive routing and a decoupled training strategy to improve visual tokenization by utilizing multi-layer features.

Findings

01

DRoRAE reduces rFID from 0.57 to 0.29 on ImageNet-256.

02

It improves generation FID from 1.74 to 1.65 with AutoGuidance.

03

Uncovers a log-linear scaling law between fusion capacity and reconstruction quality.

Abstract

Representation autoencoders that reuse frozen pretrained vision encoders as visual tokenizers have achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers via energy-constrained routing and incremental correction, producing an enriched latent compatible with a frozen pretrained decoder. A three-phase decoupled training strategy first learns the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhuzil/DRoRAE
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.