M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models

Joongmin Shin; Jeongbae Park; Jaehyung Seo; and Heuiseok Lim

arXiv:2605.18774·cs.IR·May 20, 2026

M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models

Joongmin Shin, Jeongbae Park, Jaehyung Seo, and Heuiseok Lim

PDF

TL;DR

M3DocDep is a novel LVLM-based pipeline that improves chunking of long, multi-page documents by recovering document dependencies, leading to better retrieval and question-answering performance.

Contribution

It introduces a dependency-aware chunking method that captures cross-page relations and multimodal cues, enhancing retrieval and QA in long documents.

Findings

01

Improves STEDS by +28.5 to +39.6% on DHP benchmarks.

02

Enhances retrieval nDCG by +1.1 to +15.3%.

03

Boosts QA ANLS by +4.5 to +15.3%.

Abstract

In long, multi-page industrial documents, retrieval-augmented generation (RAG) depends heavily on whether chunk boundaries follow the document's true structure. Existing text-centric chunkers and generative hierarchy parsers often miss cross-page parent-child relations, figure/table-caption bindings, and boundary cues, which leads to fragmented or redundant chunks and degrades both retrieval and answer quality. We propose M3DocDep, an LVLM-based pipeline that first recovers block-level dependencies and then constructs chunks along the recovered document tree. The pipeline uses SharedDet as a common DP+OCR preprocessing layer, extracts multimodal block embeddings with boundary-aware SoftROI pooling, scores candidate parent-child edges with a biaffine head, decodes a globally valid dependency tree with MST constraints, and builds tree-guided chunks annotated with section paths and page…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.