# SEAL: Structure and Element Aware Learning to Improve Long Structured Document Retrieval

**Authors:** Xinhao Huang, Zhibo Ren, Yipeng Yu, Ying Zhou, Zulong Chen, Zeyi Wen

arXiv: 2508.20778 · 2025-09-03

## TL;DR

This paper introduces SEAL, a structure-aware contrastive learning framework that leverages document structure and element semantics to significantly improve long structured document retrieval performance.

## Contribution

The paper proposes a novel contrastive learning approach that incorporates structural features and element-level semantics, along with a new dataset with structural annotations.

## Key findings

- Improved retrieval performance across multiple datasets and PLMs.
- Achieved NDCG@10 of 77.84% on BGE-M3, surpassing previous methods.
- Demonstrated effectiveness through extensive experiments and online A/B testing.

## Abstract

In long structured document retrieval, existing methods typically fine-tune pre-trained language models (PLMs) using contrastive learning on datasets lacking explicit structural information. This practice suffers from two critical issues: 1) current methods fail to leverage structural features and element-level semantics effectively, and 2) the lack of datasets containing structural metadata. To bridge these gaps, we propose \our, a novel contrastive learning framework. It leverages structure-aware learning to preserve semantic hierarchies and masked element alignment for fine-grained semantic discrimination. Furthermore, we release \dataset, a long structured document retrieval dataset with rich structural annotations. Extensive experiments on both released and industrial datasets across various modern PLMs, along with online A/B testing, demonstrate consistent performance improvements, boosting NDCG@10 from 73.96\% to 77.84\% on BGE-M3. The resources are available at https://github.com/xinhaoH/SEAL.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20778/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20778/full.md

## References

40 references — full list in the complete paper: https://tomesphere.com/paper/2508.20778/full.md

---
Source: https://tomesphere.com/paper/2508.20778