M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models
Joongmin Shin, Jeongbae Park, Jaehyung Seo, and Heuiseok Lim

TL;DR
M3DocDep is a novel LVLM-based pipeline that improves chunking of long, multi-page documents by recovering document dependencies, leading to better retrieval and question-answering performance.
Contribution
It introduces a dependency-aware chunking method that captures cross-page relations and multimodal cues, enhancing retrieval and QA in long documents.
Findings
Improves STEDS by +28.5 to +39.6% on DHP benchmarks.
Enhances retrieval nDCG by +1.1 to +15.3%.
Boosts QA ANLS by +4.5 to +15.3%.
Abstract
In long, multi-page industrial documents, retrieval-augmented generation (RAG) depends heavily on whether chunk boundaries follow the document's true structure. Existing text-centric chunkers and generative hierarchy parsers often miss cross-page parent-child relations, figure/table-caption bindings, and boundary cues, which leads to fragmented or redundant chunks and degrades both retrieval and answer quality. We propose M3DocDep, an LVLM-based pipeline that first recovers block-level dependencies and then constructs chunks along the recovered document tree. The pipeline uses SharedDet as a common DP+OCR preprocessing layer, extracts multimodal block embeddings with boundary-aware SoftROI pooling, scores candidate parent-child edges with a biaffine head, decodes a globally valid dependency tree with MST constraints, and builds tree-guided chunks annotated with section paths and page…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
