HybriDLA: Hybrid Generation for Document Layout Analysis
Yufan Chen, Omar Moured, Ruiping Liu, Junwei Zheng, Kunyu Peng, Jiaming Zhang, Rainer Stiefelhagen

TL;DR
HybriDLA is a novel generative framework combining diffusion and autoregressive decoding to improve document layout analysis, especially for complex and diverse modern documents, achieving state-of-the-art performance.
Contribution
The paper introduces HybriDLA, a unified generative model that integrates diffusion and autoregressive decoding for enhanced document layout analysis.
Findings
Achieves 83.5% mAP on benchmark datasets.
Outperforms previous state-of-the-art methods.
Effectively handles complex and diverse document layouts.
Abstract
Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-generation documents with a small, predetermined number of regions, this paradigm struggles with contemporary documents, which exhibit diverse element counts and increasingly complex layouts. To address challenges posed by modern documents, we present HybriDLA, a novel generative framework that unifies diffusion and autoregressive decoding within a single layer. The diffusion component iteratively refines bounding-box hypotheses, whereas the autoregressive component injects semantic and contextual awareness, enabling precise region prediction even in highly varied layouts. To further enhance detection quality, we design a multi-scale feature-fusion encoder that captures both fine-grained and high-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
